soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

[Question] What are the requirements for adding taxonomy information to a MMSEQS2 database? #779

Open jolespin opened 11 months ago

jolespin commented 11 months ago

I'm building a database that I use with MetaEuk and I have taxonomy information but I'm not sure the correct formatting to use. Also, can there be missing fields to use the taxonomy feature? I have from class to strain in most cases.

milot-mirdita commented 11 months ago

Currently, everything is tailored to the NCBI taxonomy format (taxdump). For GTDB we transform their taxonomy to a names/nodes.dmp format). If your taxonomy is NCBI based, then you can just use their taxdump.

Additionally, you need a mapping file pointing from a database accession to a taxid. For this create a TSV files that contains the same names as you can find in the .lookup (usually the first word of the FASTA header) and a second column with the taxid.

You can give both of these to createtaxdb with the --tax-mapping-file and --ncbi-tax-dump and it will create the MMseqs2 taxonomy files.

jolespin commented 11 months ago

Will the conversion error out if there are missing taxonomic levels and the taxids are alphanumeric instead of numeric?

milot-mirdita commented 11 months ago

taxids are numeric. Your tax tree in the names and nodes.dmp needs to have full lineages up to the tree and also not have cycles. The labels themselves are not important. I would suggest to use same same ranks as we assume the standard ranks in a few places.

jolespin commented 10 months ago

I have prebuilt MMSEQS databases and I'm adding taxonomy information onto them using custom taxdump and taxon mapping files. This shouldn't be an issue adding taxonomy information to a MMSEQS database post hoc correct?

milot-mirdita commented 10 months ago

Yes, you can point createtaxdb to your existing database with --ncbi-tax-dump and --tax-mapping-file as described above. In fact that's how the databases commands work, they download sequences, create a db with createdb and then manually none, one or both of the inputs for createtaxdb and call it.

jolespin commented 10 months ago

I have a few missing fields in my taxonomy levels. I'm assuming that's what is causing this error:

easy-taxonomy test_euk.fasta MicroEuk100.eukaryota_odb10 test_results/tax tmp/

MMseqs Version:                         14.7e284
ORF filter                              0
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Majority threshold                      0.5
Vote mode                               1
LCA ranks
Column with taxonomic lineage           0
Compressed                              0
Threads                                 128
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           false
Alignment mode                          0
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       0.001
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             4
k-mer length                            0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Gap pseudo count                        10
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  true
Report mode                             0
Alignment format                        0
Format alignment output                 query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                         false
First sequence as representative        false
Target column                           1
Add full header                         false
Sequence source                         0
Database type                           0
Shuffle input database                  true
Createdb mode                           1
Write lookup file                       0

createdb test_euk.fasta tmp//15404483323509150572/query --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3

Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
Multiline fasta can not be combined with --createdb-mode 0
We recompute with --createdb-mode 1
Time for merging to query_h: 0h 0m 0s 7ms
Time for merging to query: 0h 0m 0s 7ms

Time for merging to query_h: 0h 0m 0s 5ms
Time for merging to query: 0h 0m 0s 5ms
Database type: Nucleotide
Time for processing: 0h 0m 0s 67ms
Create directory tmp//15404483323509150572/taxonomy_tmp
taxonomy tmp//15404483323509150572/query MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/result tmp//15404483323509150572/taxonomy_tmp --tax-output-mode 2 --remove-tmp-files 1

extractorfs tmp//15404483323509150572/query tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 128 --compressed 0 -v 3

[=================================================================] 100.00% 10 0s 71ms
Time for merging to orfs_aa_h: 0h 0m 0s 759ms
Time for merging to orfs_aa: 0h 0m 0s 726ms
Time for processing: 0h 0m 3s 54ms
prefilter tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 2 -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3

Index version: 16
Generated by:  14.7e284
ScoreMatrix:  VTML80.out
Query database size: 2310 type: Aminoacid
Estimated memory consumption: 7G
Target database size: 713072 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 145
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2310
Target db start 1 to 713072
[=================================================================] 100.00% 2.31K 0s 200ms

8.191356 k-mers per position
1212 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
0 sequences passed prefiltering per query sequence
0 median result list length
2301 sequences with 0 size result lists
Time for merging to orfs_pref: 0h 0m 0s 457ms
Time for processing: 0h 0m 1s 487ms
rescorediagonal tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_pref tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --rescore-mode 2 --wrapped-scoring 0 --filter-hits 0 -e 100 -c 0 -a 0 --cov-mode 0 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 128 --compressed 0 -v 3

Index version: 16
Generated by:  14.7e284
ScoreMatrix:  VTML80.out
[=================================================================] 100.00% 2.31K 0s 3ms
Time for merging to orfs_aln: 0h 0m 0s 469ms
Time for processing: 0h 0m 1s 239ms
createsubdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aln.list tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter --subdb-mode 1 --id-mode 0 -v 3

Time for merging to orfs_filter: 0h 0m 0s 6ms
Time for processing: 0h 0m 0s 35ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter_h -v 3

Time for processing: 0h 0m 0s 5ms
createsubdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aln.list tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_aa_h tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter_h --subdb-mode 1 --id-mode 0 -v 3

Time for merging to orfs_filter_h: 0h 0m 0s 5ms
Time for processing: 0h 0m 0s 23ms
Create directory tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy
taxonomy tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_tax tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy --tax-output-mode 2 --tax-lineage 0 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --remove-tmp-files 1

Create directory tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1
search tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/first tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1 --alignment-mode 1 -e 1 --max-rejected 5 --max-accept 30 -s 2 --spaced-kmer-mode 1 --min-length 30 --max-length 32734 --orf-start-mode 1 --lca-search 1 --remove-tmp-files 1

prefilter tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 128 --compressed 0 -v 3 -s 2.0

Index version: 16
Generated by:  14.7e284
ScoreMatrix:  VTML80.out
Query database size: 9 type: Aminoacid
Estimated memory consumption: 7G
Target database size: 713072 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 145
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 9
Target db start 1 to 713072
[=================================================================] 100.00% 9 0s 6ms

10.636821 k-mers per position
10265 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
66 sequences passed prefiltering per query sequence
36 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 38ms
Time for processing: 0h 0m 0s 251ms
lcaalign tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_filter MicroEuk100.eukaryota_odb10.idx tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/pref_0 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/first --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 1 --alignment-output-mode 0 --wrapped-scoring 0 -e 1 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 5 --max-accept 30 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 128 --compressed 0 -v 3

Index version: 16
Generated by:  14.7e284
ScoreMatrix:  VTML80.out
Compute score only
Query database size: 9 type: Aminoacid
Target database size: 713072 type: Aminoacid
[=================================================================] 100.00% 9 0s 44ms
Time for merging to first: 0h 0m 0s 37ms
128 alignments calculated
88 sequence pairs passed the thresholds (0.687500 of overall calculated)
9.777778 hits per query sequence
Time for processing: 0h 0m 0s 189ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/pref_0 -v 3

Time for processing: 0h 0m 0s 18ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/aln_0 -v 3

Time for processing: 0h 0m 0s 2ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/input_0 -v 3

Time for processing: 0h 0m 0s 2ms
rmdb tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/tmp_hsp1/11550195074827057389/aln_merge -v 3

Time for processing: 0h 0m 0s 2ms
lca MicroEuk100.eukaryota_odb10 tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/tmp_taxonomy/7007534916548188170/first tmp//15404483323509150572/taxonomy_tmp/7054630145456894787/orfs_tax --blacklist '12908:unclassified sequences,28384:other sequences' --tax-lineage 0 --compressed 0 --threads 128 -v 3

MicroEuk100.eukaryota_odb10_mapping is empty. Rerun createtaxdb to recreate taxonomy mapping.
Error: Lca died
Error: taxonomy died
Error: Search died

Is there any way around the missing taxonomy levels?

milot-mirdita commented 10 months ago

mapping is empty sounds like something went wrong while creating this tsv files I mentioned.

Could you please write down the steps you took to generate the tax mapping?

jolespin commented 10 months ago

I used the following command:

zcat source_taxonomy.tsv.gz | cut -f1,5,6,7,8,9,10 | taxonkit create-taxdump -A 1 -O taxdump/

My file looks like this:

(base) zcat source_taxonomy.tsv.gz | cut -f1,5,6,7,8,9,10 | head
id_source   class   order   family  genus   species strain
Aalte1  Dothideomycetes Pleosporales    Pleosporaceae   Alternaria  Alternaria alternata
Aaoar1  Dothideomycetes Pleosporales        Aaosphaeria Aaosphaeria arxii
Abobi1  Agaricomycetes  Polyporales Podoscyphaceae  Abortiporus Abortiporus biennis
Abobie1 Agaricomycetes  Polyporales Podoscyphaceae  Abortiporus Abortiporus biennis
Abscae1 Mucoromycetes   Mucorales   Cunninghamellaceae  Absidia Absidia caerulea
Absrep1 Mucoromycetes   Mucorales   Cunninghamellaceae  Absidia Absidia repens
Acain1  Exobasidiomycetes   Exobasidiales   Cryptobasidiaceae   Acaromyces  Acaromyces ingoldii
acanthamoeba_castellanii_str_neff_gca_000313135     Longamoebia Acanthamoebidae Acanthamoeba    Acanthamoeba castellanii    Acanthamoeba castellanii str. Neff
Acastr1 Lecanoromycetes Acarosporales   Acarosporaceae  Acarospora  Acarospora strigata

However, further down in the file there are certain taxonomic fields that are missing.

Once I created my taxdump files. I used the following to convert my MMSEQS sequence database to a taxonomy database:

mmseqs createtaxdb MicroEuk100.eukaryota_odb10 tmp/ --ncbi-tax-dump taxdump/ --tax-mapping-file taxdump/taxid.map
mmseqs createindex MicroEuk100.eukaryota_odb10 tmp/
milot-mirdita commented 10 months ago

How did you creat taxid.map? How does it look like?

jolespin commented 10 months ago
head MicroEuk100.eukaryota_odb10_taxdump/taxid.map
NR_2364996  14049961
Colcu1  1188471730
Physo3  553026811
TARA_SOC_28_MAG_00056   67064504
EP00921 1935091368
MMETSP1440  1926664251
Armcep1 1876271683
EP00035 1932583615
Clasp332_1  745078353
Aspsim1_1   1802159425
milot-mirdita commented 10 months ago

Okay that might be correct, how does MicroEuk100.eukaryota_odb10.lookup look like?

milot-mirdita commented 10 months ago

The first column of both these files needs to be 1-to-1 mappable with each other (that's what createtaxdb does internally).

jolespin commented 10 months ago

Here's the lookup file:

head MicroEuk100.eukaryota_odb10.lookup
0   bf501444362aba95481a93b6b17ab32c    0
1   b95eacdee5dfcda27848a8e3bb07eae4    0
2   a69c82fb56c07eb33c6c42e14af053e5    0
3   e4d260c9e1de7a1e59725cab6b4b755f    0
4   4e7deb5f4c366b755604bf978f62c243    0
5   abbf860c9ec0950ff9321b18dbf743f4    0
6   2534df79c21b64df7487af20bf606ef7    0
7   3602dc0f725a4ae5ff8b64904b720820    0
8   ec35315607346e9b4b22e6f16d089c5f    0
9   2098e357ea150fc29d5a6358b3f7bb97    0
jolespin commented 10 months ago

@milot-mirdita Is that the correct formatting for the file? I suspect that it has to do with some taxonomic levels missing. Could that be throwing the error?

milot-mirdita commented 10 months ago

That does look wrong.

Each entry in the .lookup (e.g. bf501444362aba95481a93b6b17ab32c, the first one), needs to have an entry in your taxid.map:

E.g. if this was a human sequence:

bf501444362aba95481a93b6b17ab32c 9606

and so on

milot-mirdita commented 10 months ago

And to further clarify: The bf501444362aba95481a93b6b17ab32c is extracted from the FASTA header. Usually the first word before any whitespace after the >.

jolespin commented 10 months ago

Ok yea this is starting to make a little more sense now. bf501444362aba95481a93b6b17ab32c refers to the protein id and NR_2364996 refers to the source organism from the two tables.

Have you tried building a MMSEQS taxonomy database from seqkit create-taxdump files by any chance?

milot-mirdita commented 10 months ago

I didn't try seqkit for taxonomy database building.

You can also map named based on the .source files with --tax-mapping-mode 1. This will match your taxid.map to the .source instead of .lookup. The .source contains one line for every input file that was given to createdb. I.e. if each organism was one provided as one FASTA file, it would make it easier to map the taxids.