soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.31k stars 184 forks source link

mmseqs taxonomy based on GTDB + NR viruses + NR eukaryotes #849

Open pbelmann opened 1 month ago

pbelmann commented 1 month ago

Hi

I would like to taxonomically classify my protein sequences based on the GTDB taxonomy combined with the ncbi taxonomy of NR viruses and NR eukaryotes.

Do you have any suggestions on how I could build a mmseqs database consisting of these three databases and two taxonomies?

My current approach would be to create dmp files according to your description for the gtdb and merge them with the dmp files of the NR containing only viruses and eukaryotes.

milot-mirdita commented 3 weeks ago

Essentially you need:

With all of that you can call:

mmseqs createdb gtdb.fasta virus.fasta euks.fasta seqdb
mmseqs createtaxdb seqdb tmp --tax-mapping-file accession_to_taxid.tsv --ncbi-tax-dump path-to-dmp-files-dir/

seqdb will then be a normal taxonomy database.

for the tsv files you have to check that the second column (containing the accessions) in the seqdb.lookup file that is created after calling createdb matches the accessions in the first column in your tsv file.