muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Segmentation fault (core dumped) #27

Closed jaimeortiz-david closed 5 months ago

jaimeortiz-david commented 2 years ago

I am getting this error every time I run a query to the database.

Reading database metadata ... Reading 1 database part(s) ... Completed database reading. % custom query sketching settings: -sketchlen 32 -winlen 127 -winstride 112 Classifying query sequences. Per-Read mappings will be written to file: /local/workdir/metacache/results_DBsimulated/reduced_32/sample_kmer32.txt_sample_10_1m_R1.fq_sample_10_1m_R2.fq.txt Per-Taxon mappings will be written to file: /local/workdir/metacache/results_DBsimulated/reduced_32/abund_extraction.txt_sample_10_1m_R1.fq_sample_10_1m_R2.fq.txt [> ] 0%Segmentation fault (core dumped)

Funatiq commented 2 years ago

Hi! Could you please give more information on when this happens? Is this immediately at the beginning of the query? Is there anything in the output files?

Have you tried different input files for the query? Can you try input files with only a few sequences?

jaimeortiz-david commented 2 years ago

Hi,

Thank you for your response. This issue happens when I am querying my database immediately at the beginning of the query. The output files are empty, so I do not have more information to figure out the specific problem.

I have tried different input files, including simulated reads! How many sequences do you suggest I could try as a minimum? I will cut down the number of sequences on the input file.

PS. On an additional note, I am curious to know if a reference database could be built using only raw reads from the micro-organisms of interest?

Funatiq commented 2 years ago

I am not sure what's going wrong. If the error occurs for every input you tried, the database could be corrupt. You could try to rebuild the database or try a different database / different genomes. Make sure you have enough disk space available to save the database. The query should work for any number of sequences (even a single sequence).

You can build a database from sequence reads, but for this you might need to create your own taxonomy mapping files (see here).

jaimeortiz-david commented 2 years ago

Hi,

Thank you so much for your response. I will try to build the database again. maybe one of the reference genome files is corrupted.

Thank you for your guidance to create a database directly from sequence reads. As an example, I am trying to build a database using simulated Illumina Hiseq pair-end short reads (150bp) from chicken and salmon, using ART. Could I use bot pair end files as input for the database command? For example:

metacache build mydatabase fastq_folder -pairfiles

Finally, do I have to create only the assembly_summary.txt or do I also need to build my own accession2taxid?

Also, please find attached an example of my assembly_summary.txt file.

Thank you very much for your help.

Best wishes,

Jaime

On Apr 4, 2022, at 8:01 AM, Robin Kobus @.***> wrote:

I am not sure what's going wrong. If the error occurs for every input you tried, the database could be corrupt. You could try to rebuild the database or try a different database / different genomes. Make sure you have enough disk space available. The query should work for any number of sequences (even a single sequence).

You can build a database from sequence reads, but for this you might need to create your own taxonomy mapping files (see here https://github.com/muellan/metacache/blob/master/docs/building.md#target-to-taxon-mapping).

— Reply to this email directly, view it on GitHub https://github.com/muellan/metacache/issues/27#issuecomment-1087466881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETBU3CDMJIPDUPVYJIB543VDLKZ5ANCNFSM5R7Y4K4A. You are receiving this because you authored the thread.

reduced DB

assembly_accession taxid organism_name

GCA_000233375.4_ICSASG_v2_genomic.fna.gz 8030 Salmo salar GCA_016700215.2_bGalGal1.pat.whiteleghornlayer.GRCg7w_genomic.fna.gz 9031 Gallus gallus

Funatiq commented 2 years ago

Could I use bot pair end files as input for the database command?

When building a database all sequences are processed separately, so it is not possible to use the paired information.

Finally, do I have to create only the assembly_summary.txt or do I also need to build my own accession2taxid?

Either is fine. The assembly_summary.txt at the end of your post should be sufficient. MetaCache's build process will tell you if the sequences could be ranked using the files you provided.

punnettsun commented 2 years ago

I just wanted to add that I also had this segmentation fault right after the database was "read". I was able to fix my specific issue by using a higher memory node. Without it, it seems that the database is not read properly even though the log file says the database was read.