Open pbelmann opened 5 months ago
Essentially you need:
With all of that you can call:
mmseqs createdb gtdb.fasta virus.fasta euks.fasta seqdb
mmseqs createtaxdb seqdb tmp --tax-mapping-file accession_to_taxid.tsv --ncbi-tax-dump path-to-dmp-files-dir/
seqdb
will then be a normal taxonomy database.
for the tsv files you have to check that the second column (containing the accessions) in the seqdb.lookup
file that is created after calling createdb
matches the accessions in the first column in your tsv file.
Hi
I would like to taxonomically classify my protein sequences based on the GTDB taxonomy combined with the ncbi taxonomy of NR viruses and NR eukaryotes.
Do you have any suggestions on how I could build a mmseqs database consisting of these three databases and two taxonomies?
My current approach would be to create dmp files according to your description for the gtdb and merge them with the dmp files of the NR containing only viruses and eukaryotes.