muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

Segmentation fault when using --lowest species #28

Open donovan-h-parks opened 2 years ago

donovan-h-parks commented 2 years ago

Hi,

I've run into an issue where MetaCache runs as expected using the following parameters, but crashes with "Command terminated by signal 11" when the --lowest species flag is added:

-pairfiles -no-map -taxids -lineage -separate-cols -threads 32 -abundances profile.tsv -abundance-per species -out classification.log"

Is there a set of incompatible flags I'm using or is it possible that using the -lowest flag has uncovered a bug?

Thanks, Donovan

donovan-h-parks commented 2 years ago

Interestingly, everything works if I use -lowest subspecies which makes me think there is a sequence that somehow has an invalid species name. I'm using the recommended RefSeq DB with the NCBI taxonomy as per the MetaCache instructions. I've noticed that NCBI does sometime have genomes with invalid Taxon ID (i.e. the NCBI taxonomy has been updated, but the associated genome data has not been updated yet). Perhaps a similar issue is happening here.

Funatiq commented 2 years ago

Hi Donovan! I'm not sure where bad taxonomy data could cause a segfault. Invalid taxon ids should be ignored by MetaCache. Does this error happen only with abundance output? Can you please check if the per-read output works (dropping -no-map) with default output / -taxids-only?

donovan-h-parks commented 2 years ago

Can I send you the data that is causing the bug? It is ~100 GB, but I can upload it to a FTP site if you can make one available.