Closed sean-workman closed 5 months ago
I tried re-running with 250 GB RAM requested and 32 threads specified. It is now telling me it would need 717 G??
Create directory tmp
taxonomy KLEB_PO07_megahitDB /home/sdwork/scratch/metagenomics/uniref_db/uniref90 KLEB_PO07_megahit_result tmp --threads 32
MMseqs Version: GITDIR-NOTFOUND
ORF filter 1
ORF filter e-value 100
ORF filter sensitivity 2
LCA mode 3
Taxonomy output mode 0
Majority threshold 0.5
Vote mode 1
LCA ranks
Column with taxonomic lineage 0
Compressed 0
Threads 32
Verbosity 3
Taxon blacklist 12908:unclassified sequences,28384:other sequences
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 1
Alignment mode 0
Allow wrapped scoring false
E-value threshold 1
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 5
Max accept 30
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 2
k-mer length 0
Target search mode 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Prefilter mode 0
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false
extractorfs KLEB_PO07_megahitDB tmp/6964202514022042695/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-
mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-
lookup 0 --threads 32 --compressed 0 -v 3
[=================================================================] 24.08K 1s 376ms
Time for merging to orfs_aa_h: 0h 0m 0s 504ms
Time for merging to orfs_aa: 0h 0m 0s 706ms
Time for processing: 0h 0m 6s 520ms
prefilter tmp/6964202514022042695/orfs_aa /home/sdwork/scratch/metagenomics/uniref_db/uniref90 tmp/6964202514022042695/orfs_pref --sub-mat 'aa:blosum62.out,
nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 2 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-siz
e aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale
1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-
load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 32 --compressed 0 -v 3
Query database size: 627284 type: Aminoacid
Estimated memory consumption: 717G
Target database size: 187136236 type: Aminoacid
Index table k-mer threshold: 163 at k-mer size 7
Index table: counting k-mers
Trying with the easy-taxonomy
workflow got me further, but after two rounds of prefiltering I ended up getting:
Error: Lca died
Error: taxonomy died
Error: Search died
Full MMseqs2 output logfile is here
The gdb
output says:
Core was generated by `mmseqs lca /home/sdwork/scratch/metagenomics/uniref_db/uniref90 ez_tmp/88137780'.
Program terminated with signal SIGBUS, Bus error.
#0 0x00002adfa621cbf4 in ?? () from /cvmfs/soft.computecanada.ca/gentoo/2023/x86-64-v3/usr/lib64/libc.so.6
I managed to solve my own problem and it ended up being something very silly.
When using the easy-taxonomy
workflow and getting to:
Error: Lca died
Error: taxonomy died
Error: Search died
My error output showed that my DB_mapping
was empty. It was was empty because the awk
command in the createindex.sh
that populates it didn't find any matches between the DB.lookup
and taxidmapping
. This is because the UniProt IDs in the DB.lookup
were prepended with UniRef90_
. I guess if I used the full databases
workflow that might have been removed, but because I needed to do things manually due to working on a cluster where compute nodes have no internet connection it wasn't.
Things are working great now! Thanks for this software!
Sorry, didn't get around to look at this. Glad it works now. The "intended" way to do this, would have been to you the databases
workflow to download and create the database.
It has its own handling of uniref (and uniprot) based headers, and should be generally slightly better, since it directly uses the information in the header, instead of going through the idmapping.
This is the code it executes to make the _mapping
:
https://github.com/soedinglab/MMseqs2/blob/998c50a01da760713ca2c7580801e94555d23c4d/data/workflow/databases.sh#L476-L483
afterwards createtaxdb
is called setup the _taxonomy
, which basically contains the NCBI taxdump.
No worries! Always a good exercise to figure things out myself. I'm sure you're very busy and this was a problem of my own making by not using the intended workflow. I did try to use the databases
workflow initially but unfortunately the login nodes that have connection to the internet on the cluster I am using don't have the resources to deal with the size of the databases I wanted to use.
In the future I'll look to find a better workaround. With metabuli
I just downloaded the pre-built database. I don't know if the resources for this are available but perhaps it would be worthwhile to do a similar thing here? Either way, thanks again for providing this excellent resource and good luck with CASP16! :)
We have also moved to prebuilt dbs for foldseek. I don't think we would be able to keep up with the two month release cycles of the uniref/uniprot though, so probably no prebuilt databases for MMseqs2.
Thanks a lot!
Expected Behavior
Taxonomic classification of contigs in my metagenomic assembly using the UniRef90 database.
Current Behavior
After a first round of prefilter, rescorediagonal is executed, some merge steps are executed, new tmp directories are created, and the program dies partway through the second round of prefilter.
Steps to Reproduce (for bugs)
Downloaded the UniRef90 database with wget:
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
Decompressed with gunzip, then ran createdb:
mmseqs createdb uniref90.fasta uniref90
Augmented with taxonomic information (used -db-mode 0 because createbintaxonomy kept crashing as well):
mmseqs createtaxdb uniref90 tmp --tax-db-mode 0
Created database for my query sequences:
mmseqs createdb KLEB_PO07_megahit.fasta KLEB_PO07_megahitDB
Ran mmseqs taxonomy on cluster with slurm script:
MMseqs Output (for bugs)
Full output can be found in this gist.
I also see this output in my error file:
tmp/1193166584733320518/tmp_taxonomy/17149912652888480377/tmp_hsp1/10699950925961740214/blastp.sh: line 135: 8379 Bus error (core dumped) $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS"
Context
I created metagenomic assemblies using megahit and metaSPAdes. I am trying to get MMseqs2 working to do taxonomic classification. I am running on Digital Research Alliance of Canada clusters.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
I ran
lscpu
on a login node and got what is shown below, but the memory and CPUs that I had for the job were specified in the slurm job script shown above.