[FEATURE REQUEST] A command to make new HMM sources from a list of COG IDs.

Ge0rges commented 1 year ago

The need

HMMs are leveraged widely in anvio to identify genes for various metrics. Enabling easier access to custom HMM would allow people to more easily ask their own questions about what a bacterium is doing, e.g. looking for antimicrobial resistance genes, looking for mobile genes, etc. I have been thinking about this this week after discussing with @ivagljiva ways to make HMMs less convoluted.

The solution

Much like the (I think) defunct anvi-script-pfam-accessions-to-hmms-directory an anvi-script which takes in a list of COG IDs and generates an anvio compatible HMM.

This would also be the occasion to facilitate the generation of HMMs generally. A backend script which takes in multiple FASTA files, one per gene, and creates an HMM could also be made user-facing so that more advanced users could provide directly protein sequences. In conjunction with the new functionality of anti-run-ncbi-cogs those could be directly annotated to populate the genes.txt file and turn into an anvio hmm-source.

Finally, this would be the chance to revisit the structure of hmm-source even if just modestly, to put all the one liner files together, if necessary.

Beneficiaries

Most users of anvio looking at functionality of their communities.

meren commented 1 year ago

So essentially with this kind of feature anvi'o would be able to take a bunch of gene sequences from anywhere, align them, hmmbuild, and turn the outuput in an hmm-source, am I understanding it correctly?

Much like the (I think) defunct anvi-script-pfam-accessions-to-hmms-directory

I didn't know it was defunct? I just used it a few days ago 😂

Ge0rges commented 1 year ago

Apologies, I think I' either thinking of a different script or had trouble accessing Pfam and assumed the merge broke the script.

Yes, you're understanding is correct.

dspeth commented 1 year ago

Came here for a different issue, but saw this and wanted to drop my two cents. At least for some COGs, the definition is quite broad (see the somewhat frequent "or related enzyme" appendix to the function name) and that might cause problems when these are turned into HMMs without further curation. if a feature like this is (primarily) user facing, that might cause some data analysis issues

meren commented 1 year ago

Thanks for your input, Daan. That's also my concern since the sensitivity and specificity of the models will be all over the place since the number of genes and their conservancy within COGs are all over the place. I wonder what is it we could do to make sure people don't reach misleading conclusions with a tool like this if implemented.

dspeth commented 1 year ago

Agreed. I was thinking along the lines of automated sequence dataset cleanup and wonder if there could be sanity checks applied to the genes retrieved, to see whether it is sensible that they indeed represent homologous proteins. I have no immediate good suggestions, other than a blunt length assessment.

I tried average pairwise identity the other day, but that wasn't very helpful for the use case I was looking at. All v All pairwise alignment does help identify outlier sequences as those only aligning to a small subset of the tested sequences.

Ge0rges commented 1 year ago

What about clustering with an identity threshold akin to usearch cluster_fast?

Ge0rges commented 1 year ago

Perhaps a reduced scope would be a script that makes an HMM given protein sequences, and their corresponding gene names, at least to begin with.

meren commented 1 year ago

It seems to me that it would be suffucient to have a wrapper for hmmbuild and some downstream code to turn the resulting HMM(s) to conform with user-defined hmm-source file structure. A script like anvi-script-pfam-accessions-to-hmms-directory but anvi-script-fasta-to-hmms-directory.

Ge0rges commented 8 months ago

Hey @meren,

Just getting back to this. I was looking through scripts I've written in the pasts that do this and came upon this script.

Would something like this written as a python script be useful at all?

The reason for that question is that while brainstorming what such a script would look like I ended up with a simple script that took in lots of arguments (essentially one per HMM source file), and was then wondering if it's not easier for a user to do this on their own. This type of script would be more useful if the user could give a .hmm file, and need to specify at most 1/2 parameters if Anvi'o could infer the rest.

meren commented 8 months ago

Hey @Ge0rges,

Would something like this written as a python script be useful at all?

Absolutely! Any script with good documentation and example use cases would be useful! We all need these kinds of functionality, and if the script is discoverable, then it becomes a lifesaver :)

Please feel free to go at it and send a PR. I would be more than happy to work with you on this and serve as your guinea pig!

Ge0rges commented 8 months ago

PR submitted.

meren commented 8 months ago

💪

merenlab / anvio