soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

LCA fails with segmentation fault #703

Open itsmisterbrown opened 1 year ago

itsmisterbrown commented 1 year ago

Expected Behavior

Taxonomy assignment of viral OTU sequences (nucleotide) using the 2bLCA method against a custom formatted amino acid database from IMG/VR

Current Behavior

The LCA step dies due to a segmentation fault when using a small test dataset that I have previously had success with when using Antônio Camargo's ICTV MMseqs2 protein database (https://github.com/apcamargo/ictv-mmseqs2-protein-database).

For reference, I have also allocated 40 cores and 700gb RAM to this job, which fails after consuming only 178gb of mem.

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

I have formatted the IMG/VR v4 7.1 AA database as recommended (https://github.com/soedinglab/MMseqs2/wiki#create-a-seqtaxdb-by-manual-annotation-of-a-sequence-database) and I've created a custom taxdump using taxonkit. The custom taxdb was created without issue:

mmseqs createdb --dbtype 1 IMGVR_all_proteins-high_confidence.faa.gz IMG_tax_db/IMG_tax_db
createdb --dbtype 1 IMGVR_all_proteins-high_confidence.faa.gz IMG_tax_db/IMG_tax_db 

MMseqs Version:         14.7e284
Database type           1
Shuffle input database  true
Createdb mode           0
Write lookup file       1
Offset of numeric ids   0
Compressed              0
Verbosity               3

Converting sequences
[112567430] 8m 8s 166mss
Time for merging to IMG_tax_db_h: 0h 0m 39s 840ms
Time for merging to IMG_tax_db: 0h 1m 54s 537ms
Database type: Aminoacid
Time for processing: 0h 14m 27s 634ms

#integrate all into a complete mmseqs2 taxdb
mmseqs createtaxdb IMG_tax_db/IMG_tax_db /home/bbrow6/tmp --ncbi-tax-dump IMG_taxdump --tax-mapping-file UVIG_taxid_mapping_cleaned

createtaxdb IMG_tax_db/IMG_tax_db /home/bbrow6/tmp --ncbi-tax-dump IMG_taxdump --tax-mapping-file UVIG_taxid_mapping_cleaned 

MMseqs Version:         14.7e284
NCBI tax dump directory IMG_taxdump
Taxonomy mapping file   UVIG_taxid_mapping_cleaned
Taxonomy mapping mode   0
Taxonomy db mode        1
Threads                 28
Verbosity               3

Loading nodes file ... Done, got 6986 nodes
Loading merged file ... Done, added 0 merged nodes.
Loading names file ... Done
Init RMQ ...Done

the job was submitted with teh following batch script, including params:

#PBS -M bryan.brown@seattlechildrens.org
#PBS -m a
#PBS -l mem=700gb
#PBS -l nodes=1:ppn=40
#PBS -P a675a67f-9204-4f66-9785-891b95c7d3da
#PBS -q paidq
#PBS -o /home/bbrow6/script_output/job-mmseqs_easytax_050523.out
#PBS -e /home/bbrow6/script_error/job-mmseqs_easytax_050523.err

cd /home/bbrow6/taxonomy_stuffs
export DBs=/home/bbrow6/JGI/IMG_VR_2022_12_19_7.1/IMG_tax_db
export OTU_dir=/home/bbrow6/vaginal_virome/Run_021723/identified_viral_sequences/OTUs/geNomad/genomad_output_1000bps/clustered_spades_cross_assembly_contigs_gt1000bps_summary/

source activate mmseqs2
module load OpenMPI

mmseqs easy-taxonomy $OTU_dir/clustered_spades_cross_assembly_contigs_gt1000bps_virus.fna $DBs/IMG_tax_db vag_taxonomy_results_IMG tmp -e 1e-5 -s 6 --blacklist "" --tax-lineage 1

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Full output and error attached below

tmp/10336174962539687461/taxonomy_tmp/11653652317365833767/tmp_taxonomy/6923600097584969791/taxonomy.sh: line 58: 78000 Segmentation fault (core dumped) "$MMSEQS" lca "${TARGET}" "${LCAIN}" "${RESULTS}" ${LCA_PAR}

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

job-mmseqs_easytax_050523_error.txt job-mmseqs_easytax_050523_out.txt

milot-mirdita commented 1 year ago

How large is the database you created? Would it be possible to share?

How does your tax mapping look like (UVIG_taxid_mapping_cleaned). It seems to create some very large taxid values (1446979566). Maybe I didn't correctly consider that they could be so large.

yosei-yung commented 12 months ago

I got a similar error with itsmisterbrown that the LCA step dies due to a segmentation fault. Here is my command line. And I also attached my log and error files. out.txt err.txt

   mmseqs easy-taxonomy \
    test.fasta nr.smag.mmetsp.gvog.faaDB \
    DB_NR.SMAG.DB_tax_result_test \
    tmp \
    --orf-filter 0 \
    --threads 16 \
    --lca-ranks superkingdom,phylum,class,order,family,genus \
    --split-memory-limit 500G

Please help me to find out what wrong with my command.