Closed pck00 closed 7 months ago
Hi Paul,
Thanks for you kind comments!
What you are asking for sounds like a whole new thing. Currently, taxonomy calls in Cenote-Taker 3
are made by mmseqs2
search (blastp-style) of hallmark genes to a taxonomically-labeled protein database.
Doing a similar task with hmms (presumably from MSAs of several proteins) would be complicated from a taxonomical labelling standpoint.
If you could devise a detailed scheme about how this might work, you could submit a reply here and I could consider it in future updates if I think it would add a lot of value. To be honest, I've gone through a long stretch of adding features and fixing bugs on this tool, and, in the near future, I'm mostly going to be doing updates to fix bugs and update databases.
Sorry I coudn't be more positive at this time.
Mike
Hmm, I'm not stuck on using hmms.
Is it possible to include custom taxonomy through regenerating the mmseqs database by adding my proteins to refseq_virus_prot.fasta and ct3_hallmark_nr_cd90_refseq.prot_taxids.mmseqs_fmt.tsv? I suppose it would be missing another step where taxon id is linked to a taxon name somehow, can't find that file.
Hey Paul,
So, yes, in theory you could add NCBI-derived amino acid sequences to ct3_hallmark_nr_cd90_refseq.faa
and the corresponding accession + taxID to ct3_hallmark_nr_cd90_refseq.prot_taxids.mmseqs_fmt.tsv
then run these commands and replace the typical database files. I can't officially recommend it:
mmseqs createdb ct3_hallmark_nr_cd90_refseq.faa hallmark_taxdb/ct3_hallmark.taxDB
mmseqs createtaxdb hallmark_taxdb/ct3_hallmark.taxDB tmp --tax-mapping-file ct3_hallmark_nr_cd90_refseq.prot_taxids.mmseqs_fmt.tsv
Keep in mind that only the CT3-identified hallmark genes are queried against the taxDB, and barring some fancy work with taxonkit
, you have to use NCBI-derived sequences in order to get associated taxonomy.
I can make that work for now, thanks!
Hello, awesome update, loving the speed!
Not really an issue per se, but I have a hmmscan database that I would like to use preferentially for taxonomy calls. It's not a comprehensive database, so ideally this would be used first, and whatever genomes aren't called by the database should still go through the included databases.
What can I do to make that happen?
Paul