soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.44k stars 195 forks source link

easy-taxonomy error when using GTDB #806

Open AstrobioMike opened 10 months ago

AstrobioMike commented 10 months ago

Thanks for maintaining this great software! I'm having an issue with easy-taxonomy using GTDB (but not NCBI with the same query input), described below. Using a conda install of version 15.6f452.

Thanks for any help!

Expected Behavior

Completing without error

Current Behavior

Fails at aggregatetaxweights with the following:

Missing key 0 in tax result                                       ] 0.00% 1 eta -
Error: aggregatetaxweights died
Error: Search died

Full log here: easy-tax-full-log-error.txt

Steps to Reproduce (for bugs)

Install

mamba create -y -n mmseqs2 -c conda-forge -c bioconda -c defaults mmseqs2==15.6f452
conda activate mmseqs2

DB setup

mmseqs databases GTDB mmseqs2-GTDB-db tmp

Making small test data

wget -O e-coli.fasta.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2/GCA_000005845.2_ASM584v2_genomic.fna.gz

gunzip e-coli.fasta.gz

grep -c ">" e-coli.fasta
# there is only one contig, so safe to just pull some lines

printf ">contig_1\n" > contigs.fasta
sed -n '100,1200p' e-coli.fasta >> contigs.fasta
printf ">contig_2\n" >> contigs.fasta
sed -n '20000,20600p' e-coli.fasta >> contigs.fasta
printf ">contig_3\n" >> contigs.fasta
sed -n '26000,26200p' e-coli.fasta >> contigs.fasta
    # that's 3 contigs: 88,000 bps; 48,000 bps; and 16,000 bps

Running the program

mmseqs easy-taxonomy contigs.fasta mmseqs2-GTDB-db GTDB-tax-result tax-tmp \
       --threads 20 --tax-lineage 1 --compressed 1 --remove-tmp-files 0

MMseqs Output (for bugs)

Fails at aggregatetaxweights with the following:

Missing key 0 in tax result                                       ] 0.00% 1 eta -
Error: aggregatetaxweights died
Error: Search died

Full log here: easy-tax-full-log-error.txt

Context

Trying to get taxonomy output via GTDB with lineage info added. Using the NCBI database completes successfully on the same input query.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

jasmezz commented 1 month ago

I am having a very similar error:

Current behaviour After submitting a mmseqs taxonomy run, this sub command is being executed (and dies):

aggregatetaxweights mmseqs_database/database tmp1/14824571404584235274/orfs_h_swapped tmp1/14824571404584235274/orfs_tax tmp1/14824571404584235274/orfs_tax_aln SWH_IN_taxonomy/SWH_IN --lca-ranks kingdom,phylum,class,order,family,genus,species --tax-lineage 1 --compressed 1 --threads 12 -v 3

MMseqs output

  Missing key 0 in tax result
  tmp1/14824571404584235274/taxpercontig.sh: line 85: 206297 Aborted                 (core dumped) "$MMSEQS" aggregatetaxweights "${TAX_SEQ_DB}" "${TMP_PATH}/orfs_h_swapped" "${TMP_PATH}/orfs_tax" "${TMP_PATH}/orfs_tax_aln" "${RESULTS}" ${AGGREGATETAX_PAR}
  Error: aggregatetaxweights died

Environment

Comment I know that for mmseqs taxonomy classification with GTDB at least 900 GB RAM are needed, so I am not surprised that your process died @AstrobioMike. And since I seem to have a very similar error (if not the same) maybe even my 950 GB RAM are not enough, I wonder...