Open jolespin opened 11 months ago
Currently, everything is tailored to the NCBI taxonomy format (taxdump). For GTDB we transform their taxonomy to a names/nodes.dmp format). If your taxonomy is NCBI based, then you can just use their taxdump.
Additionally, you need a mapping file pointing from a database accession to a taxid. For this create a TSV files that contains the same names as you can find in the .lookup
(usually the first word of the FASTA header) and a second column with the taxid.
You can give both of these to createtaxdb
with the --tax-mapping-file
and --ncbi-tax-dump
and it will create the MMseqs2 taxonomy files.
Will the conversion error out if there are missing taxonomic levels and the taxids are alphanumeric instead of numeric?
taxids are numeric. Your tax tree in the names and nodes.dmp needs to have full lineages up to the tree and also not have cycles. The labels themselves are not important. I would suggest to use same same ranks as we assume the standard ranks in a few places.
I have prebuilt MMSEQS databases and I'm adding taxonomy information onto them using custom taxdump and taxon mapping files. This shouldn't be an issue adding taxonomy information to a MMSEQS database post hoc correct?
Yes, you can point createtaxdb
to your existing database with --ncbi-tax-dump
and --tax-mapping-file
as described above. In fact that's how the databases
commands work, they download sequences, create a db with createdb
and then manually none, one or both of the inputs for createtaxdb
and call it.
I have a few missing fields in my taxonomy levels. I'm assuming that's what is causing this error:
easy-taxonomy test_euk.fasta MicroEuk100.eukaryota_odb10 test_results/tax tmp/
MMseqs Version: 14.7e284
ORF filter 0
ORF filter e-value 100
ORF filter sensitivity 2
LCA mode 3
Majority threshold 0.5
Vote mode 1
LCA ranks
Column with taxonomic lineage 0
Compressed 0
Threads 128
Verbosity 3
Taxon blacklist 12908:unclassified sequences,28384:other sequences
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 0
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 4
k-mer length 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Minimum diagonal score 15
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Gap pseudo count 10
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files true
Report mode 0
Alignment format 0
Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output false
First sequence as representative false
Target column 1
Add full header false
Sequence source 0
Database type 0
Shuffle input database true
Createdb mode 1
Write lookup file 0
createdb test_euk.fasta tmp//15404483323509150572/query --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
Multiline fasta can not be combined with --createdb-mode 0
We recompute with --createdb-mode 1
Time for merging to query_h: 0h 0m 0s 7ms
Time for merging to query: 0h 0m 0s 7ms
Time for merging to query_h: 0h 0m 0s 5ms
Time for merging to query: 0h 0m 0s 5ms
Database type: Nucleotide
Time for processing: 0h 0m 0s 67ms
Create directory tmp//15404483323509150572/taxonomy_tmp
taxonomy tmp//15404483323509150572/query MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/result tmp//15404483323509150572/taxonomy_tmp --tax-output-mode 2 --remove-tmp-files 1
extractorfs tmp//15404483323509150572/query tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 128 --compressed 0 -v 3
[=================================================================] 100.00% 10 0s 71ms
Time for merging to orfs_aa_h: 0h 0m 0s 759ms
Time for merging to orfs_aa: 0h 0m 0s 726ms
Time for processing: 0h 0m 3s 54ms
prefilter tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 2 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3
Index version: 16
Generated by: 14.7e284
ScoreMatrix: VTML80.out
Query database size: 2310 type: Aminoacid
Estimated memory consumption: 7G
Target database size: 713072 type: Aminoacid
Process prefiltering step 1 of 1
k-mer similarity threshold: 145
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2310
Target db start 1 to 713072
[=================================================================] 100.00% 2.31K 0s 200ms
8.191356 k-mers per position
1212 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
0 sequences passed prefiltering per query sequence
0 median result list length
2301 sequences with 0 size result lists
Time for merging to orfs_pref: 0h 0m 0s 457ms
Time for processing: 0h 0m 1s 487ms
rescorediagonal tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_pref tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 100 -c 0 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 128 --compressed 0 -v 3
Index version: 16
Generated by: 14.7e284
ScoreMatrix: VTML80.out
[=================================================================] 100.00% 2.31K 0s 3ms
Time for merging to orfs_aln: 0h 0m 0s 469ms
Time for processing: 0h 0m 1s 239ms
createsubdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aln.list tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter --subdb-mode 1 --id-mode 0 -v 3
Time for merging to orfs_filter: 0h 0m 0s 6ms
Time for processing: 0h 0m 0s 35ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter_h -v 3
Time for processing: 0h 0m 0s 5ms
createsubdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aln.list tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa_h tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter_h --subdb-mode 1 --id-mode 0 -v 3
Time for merging to orfs_filter_h: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 23ms
Create directory tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy
taxonomy tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_tax tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy --tax-output-mode 2 --tax-lineage 0 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --remove-tmp-files 1
Create directory tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1
search tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/first tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --lca-search 1 --remove-tmp-files 1
prefilter tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3 -s 2.0
Index version: 16
Generated by: 14.7e284
ScoreMatrix: VTML80.out
Query database size: 9 type: Aminoacid
Estimated memory consumption: 7G
Target database size: 713072 type: Aminoacid
Process prefiltering step 1 of 1
k-mer similarity threshold: 145
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 9
Target db start 1 to 713072
[=================================================================] 100.00% 9 0s 6ms
10.636821 k-mers per position
10265 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
66 sequences passed prefiltering per query sequence
36 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 38ms
Time for processing: 0h 0m 0s 251ms
lcaalign tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/pref_0 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/first --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 1 --alignment-output-mode 0 --wrapped-scoring 0 -e 1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 5 --max-accept 30 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 128 --compressed 0 -v 3
Index version: 16
Generated by: 14.7e284
ScoreMatrix: VTML80.out
Compute score only
Query database size: 9 type: Aminoacid
Target database size: 713072 type: Aminoacid
[=================================================================] 100.00% 9 0s 44ms
Time for merging to first: 0h 0m 0s 37ms
128 alignments calculated
88 sequence pairs passed the thresholds (0.687500 of overall calculated)
9.777778 hits per query sequence
Time for processing: 0h 0m 0s 189ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/pref_0 -v 3
Time for processing: 0h 0m 0s 18ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/aln_0 -v 3
Time for processing: 0h 0m 0s 2ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/input_0 -v 3
Time for processing: 0h 0m 0s 2ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/aln_merge -v 3
Time for processing: 0h 0m 0s 2ms
lca MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/first tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_tax --blacklist '12908:unclassified sequences,28384:other sequences' --tax-lineage 0 --compressed 0 --threads 128 -v 3
MicroEuk100.eukaryota_odb10_mapping is empty. Rerun createtaxdb to recreate taxonomy mapping.
Error: Lca died
Error: taxonomy died
Error: Search died
Is there any way around the missing taxonomy levels?
mapping is empty sounds like something went wrong while creating this tsv files I mentioned.
Could you please write down the steps you took to generate the tax mapping?
I used the following command:
zcat source_taxonomy.tsv.gz | cut -f1,5,6,7,8,9,10 | taxonkit create-taxdump -A 1 -O taxdump/
My file looks like this:
(base) zcat source_taxonomy.tsv.gz | cut -f1,5,6,7,8,9,10 | head
id_source class order family genus species strain
Aalte1 Dothideomycetes Pleosporales Pleosporaceae Alternaria Alternaria alternata
Aaoar1 Dothideomycetes Pleosporales Aaosphaeria Aaosphaeria arxii
Abobi1 Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis
Abobie1 Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis
Abscae1 Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia caerulea
Absrep1 Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia repens
Acain1 Exobasidiomycetes Exobasidiales Cryptobasidiaceae Acaromyces Acaromyces ingoldii
acanthamoeba_castellanii_str_neff_gca_000313135 Longamoebia Acanthamoebidae Acanthamoeba Acanthamoeba castellanii Acanthamoeba castellanii str. Neff
Acastr1 Lecanoromycetes Acarosporales Acarosporaceae Acarospora Acarospora strigata
However, further down in the file there are certain taxonomic fields that are missing.
Once I created my taxdump files. I used the following to convert my MMSEQS sequence database to a taxonomy database:
mmseqs createtaxdb MicroEuk100.eukaryota_odb10 tmp/ --ncbi-tax-dump taxdump/ --tax-mapping-file taxdump/taxid.map
mmseqs createindex MicroEuk100.eukaryota_odb10 tmp/
How did you creat taxid.map? How does it look like?
head MicroEuk100.eukaryota_odb10_taxdump/taxid.map
NR_2364996 14049961
Colcu1 1188471730
Physo3 553026811
TARA_SOC_28_MAG_00056 67064504
EP00921 1935091368
MMETSP1440 1926664251
Armcep1 1876271683
EP00035 1932583615
Clasp332_1 745078353
Aspsim1_1 1802159425
Okay that might be correct, how does MicroEuk100.eukaryota_odb10.lookup
look like?
The first column of both these files needs to be 1-to-1 mappable with each other (that's what createtaxdb
does internally).
Here's the lookup file:
head MicroEuk100.eukaryota_odb10.lookup
0 bf501444362aba95481a93b6b17ab32c 0
1 b95eacdee5dfcda27848a8e3bb07eae4 0
2 a69c82fb56c07eb33c6c42e14af053e5 0
3 e4d260c9e1de7a1e59725cab6b4b755f 0
4 4e7deb5f4c366b755604bf978f62c243 0
5 abbf860c9ec0950ff9321b18dbf743f4 0
6 2534df79c21b64df7487af20bf606ef7 0
7 3602dc0f725a4ae5ff8b64904b720820 0
8 ec35315607346e9b4b22e6f16d089c5f 0
9 2098e357ea150fc29d5a6358b3f7bb97 0
@milot-mirdita Is that the correct formatting for the file? I suspect that it has to do with some taxonomic levels missing. Could that be throwing the error?
That does look wrong.
Each entry in the .lookup
(e.g. bf501444362aba95481a93b6b17ab32c
, the first one), needs to have an entry in your taxid.map
:
E.g. if this was a human sequence:
bf501444362aba95481a93b6b17ab32c 9606
and so on
And to further clarify: The bf501444362aba95481a93b6b17ab32c
is extracted from the FASTA header. Usually the first word before any whitespace after the >
.
Ok yea this is starting to make a little more sense now. bf501444362aba95481a93b6b17ab32c
refers to the protein id and NR_2364996
refers to the source organism from the two tables.
Have you tried building a MMSEQS taxonomy database from seqkit create-taxdump files by any chance?
I didn't try seqkit for taxonomy database building.
You can also map named based on the .source
files with --tax-mapping-mode 1
. This will match your taxid.map
to the .source
instead of .lookup
. The .source
contains one line for every input file that was given to createdb
. I.e. if each organism was one provided as one FASTA file, it would make it easier to map the taxids.
I'm building a database that I use with MetaEuk and I have taxonomy information but I'm not sure the correct formatting to use. Also, can there be missing fields to use the taxonomy feature? I have from class to strain in most cases.