mtisza1 / Cenote-Taker3

Discover and annotate the virome
MIT License
32 stars 1 forks source link

Custom taxonomy? #8

Closed pck00 closed 7 months ago

pck00 commented 7 months ago

Hello, awesome update, loving the speed!

Not really an issue per se, but I have a hmmscan database that I would like to use preferentially for taxonomy calls. It's not a comprehensive database, so ideally this would be used first, and whatever genomes aren't called by the database should still go through the included databases.

What can I do to make that happen?

Paul

mtisza1 commented 7 months ago

Hi Paul,

Thanks for you kind comments!

What you are asking for sounds like a whole new thing. Currently, taxonomy calls in Cenote-Taker 3 are made by mmseqs2 search (blastp-style) of hallmark genes to a taxonomically-labeled protein database.

Doing a similar task with hmms (presumably from MSAs of several proteins) would be complicated from a taxonomical labelling standpoint.

If you could devise a detailed scheme about how this might work, you could submit a reply here and I could consider it in future updates if I think it would add a lot of value. To be honest, I've gone through a long stretch of adding features and fixing bugs on this tool, and, in the near future, I'm mostly going to be doing updates to fix bugs and update databases.

Sorry I coudn't be more positive at this time.

Mike

pck00 commented 7 months ago

Hmm, I'm not stuck on using hmms.

Is it possible to include custom taxonomy through regenerating the mmseqs database by adding my proteins to refseq_virus_prot.fasta and ct3_hallmark_nr_cd90_refseq.prot_taxids.mmseqs_fmt.tsv? I suppose it would be missing another step where taxon id is linked to a taxon name somehow, can't find that file.

mtisza1 commented 7 months ago

Hey Paul,

So, yes, in theory you could add NCBI-derived amino acid sequences to ct3_hallmark_nr_cd90_refseq.faa and the corresponding accession + taxID to ct3_hallmark_nr_cd90_refseq.prot_taxids.mmseqs_fmt.tsv then run these commands and replace the typical database files. I can't officially recommend it:

mmseqs createdb ct3_hallmark_nr_cd90_refseq.faa hallmark_taxdb/ct3_hallmark.taxDB

mmseqs createtaxdb hallmark_taxdb/ct3_hallmark.taxDB tmp --tax-mapping-file ct3_hallmark_nr_cd90_refseq.prot_taxids.mmseqs_fmt.tsv

Keep in mind that only the CT3-identified hallmark genes are queried against the taxDB, and barring some fancy work with taxonkit, you have to use NCBI-derived sequences in order to get associated taxonomy.

pck00 commented 7 months ago

I can make that work for now, thanks!