Prefilter (second round) dies during taxonomic classification with UniRef90 DB

Expected Behavior

Taxonomic classification of contigs in my metagenomic assembly using the UniRef90 database.

Current Behavior

After a first round of prefilter, rescorediagonal is executed, some merge steps are executed, new tmp directories are created, and the program dies partway through the second round of prefilter.

Steps to Reproduce (for bugs)

Downloaded the UniRef90 database with wget: wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz

Decompressed with gunzip, then ran createdb: mmseqs createdb uniref90.fasta uniref90

Augmented with taxonomic information (used -db-mode 0 because createbintaxonomy kept crashing as well): mmseqs createtaxdb uniref90 tmp --tax-db-mode 0

Created database for my query sequences: mmseqs createdb KLEB_PO07_megahit.fasta KLEB_PO07_megahitDB

Ran mmseqs taxonomy on cluster with slurm script:

#!/usr/bin/env bash

#SBATCH --job-name=KLEB_PO07_mmseqs
#SBATCH --cpus-per-task=32
#SBATCH --mem=150G
#SBATCH --time=0-3:00
#SBATCH --output=KLEB_PO07_mmseqs.log
#SBATCH --error=KLEB_PO07_mmseqs.err

module load mmseqs2/15-6f452
taxDB=/home/sdwork/scratch/metagenomics/uniref_db/uniref90

mmseqs taxonomy KLEB_PO07_megahitDB $taxDB KLEB_PO07_megahit_result tmp

MMseqs Output (for bugs)

Full output can be found in this gist.

I also see this output in my error file: tmp/1193166584733320518/tmp_taxonomy/17149912652888480377/tmp_hsp1/10699950925961740214/blastp.sh: line 135: 8379 Bus error (core dumped) $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS"

Context

I created metagenomic assemblies using megahit and metaSPAdes. I am trying to get MMseqs2 working to do taxonomic classification. I am running on Digital Research Alliance of Canada clusters.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 15-6f452
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): Loaded as a module on a cluster.
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: Unsure

I ran lscpu on a login node and got what is shown below, but the memory and CPUs that I had for the job were specified in the slurm job script shown above.

lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           2
    Stepping:            1
    CPU(s) scaling MHz:  100%
    CPU max MHz:         3200.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            6384.78
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
                         pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 d
                         s_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16
                         c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexp
                         riority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc c
                         qm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    4 MiB (16 instances)
  L3:                    50 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31

I tried re-running with 250 GB RAM requested and 32 threads specified. It is now telling me it would need 717 G??

Create directory tmp
taxonomy KLEB_PO07_megahitDB /home/sdwork/scratch/metagenomics/uniref_db/uniref90 KLEB_PO07_megahit_result tmp --threads 32

MMseqs Version:                         GITDIR-NOTFOUND
ORF filter                              1
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Taxonomy output mode                    0
Majority threshold                      0.5
Vote mode                               1
LCA ranks
Column with taxonomic lineage           0
Compressed                              0
Threads                                 32
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          1
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       1
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              5
Max accept                              30
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             2
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Prefilter mode                          0
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false

extractorfs KLEB_PO07_megahitDB tmp/6964202514022042695/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-
mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-
lookup 0 --threads 32 --compressed 0 -v 3

[=================================================================] 24.08K 1s 376ms
Time for merging to orfs_aa_h: 0h 0m 0s 504ms
Time for merging to orfs_aa: 0h 0m 0s 706ms
Time for processing: 0h 0m 6s 520ms
prefilter tmp/6964202514022042695/orfs_aa /home/sdwork/scratch/metagenomics/uniref_db/uniref90 tmp/6964202514022042695/orfs_pref --sub-mat 'aa:blosum62.out,
nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 2 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-siz
e aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale
1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-
load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 32 --compressed 0 -v 3

Query database size: 627284 type: Aminoacid
Estimated memory consumption: 717G
Target database size: 187136236 type: Aminoacid
Index table k-mer threshold: 163 at k-mer size 7
Index table: counting k-mers

Trying with the easy-taxonomy workflow got me further, but after two rounds of prefiltering I ended up getting:

Error: Lca died
Error: taxonomy died
Error: Search died

Full MMseqs2 output logfile is here

The gdb output says:

Core was generated by `mmseqs lca /home/sdwork/scratch/metagenomics/uniref_db/uniref90 ez_tmp/88137780'.
Program terminated with signal SIGBUS, Bus error.
#0  0x00002adfa621cbf4 in ?? () from /cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/lib64/libc.so.6

I managed to solve my own problem and it ended up being something very silly.

When using the easy-taxonomy workflow and getting to:

Error: Lca died
Error: taxonomy died
Error: Search died

My error output showed that my DB_mapping was empty. It was was empty because the awk command in the createindex.sh that populates it didn't find any matches between the DB.lookup and taxidmapping. This is because the UniProt IDs in the DB.lookup were prepended with UniRef90_. I guess if I used the full databases workflow that might have been removed, but because I needed to do things manually due to working on a cluster where compute nodes have no internet connection it wasn't.

Things are working great now! Thanks for this software!

Sorry, didn't get around to look at this. Glad it works now. The "intended" way to do this, would have been to you the databases workflow to download and create the database.

It has its own handling of uniref (and uniprot) based headers, and should be generally slightly better, since it directly uses the information in the header, instead of going through the idmapping.

This is the code it executes to make the _mapping: https://github.com/soedinglab/MMseqs2/blob/998c50a01da760713ca2c7580801e94555d23c4d/data/workflow/databases.sh#L476-L483

afterwards createtaxdb is called setup the _taxonomy, which basically contains the NCBI taxdump.

No worries! Always a good exercise to figure things out myself. I'm sure you're very busy and this was a problem of my own making by not using the intended workflow. I did try to use the databases workflow initially but unfortunately the login nodes that have connection to the internet on the cluster I am using don't have the resources to deal with the size of the databases I wanted to use.

In the future I'll look to find a better workaround. With metabuli I just downloaded the pre-built database. I don't know if the resources for this are available but perhaps it would be worthwhile to do a similar thing here? Either way, thanks again for providing this excellent resource and good luck with CASP16! :)

We have also moved to prebuilt dbs for foldseek. I don't think we would be able to keep up with the two month release cycles of the uniref/uniprot though, so probably no prebuilt databases for MMseqs2.

Thanks a lot!

soedinglab / MMseqs2