Closed Ge0rges closed 8 months ago
So essentially with this kind of feature anvi'o would be able to take a bunch of gene sequences from anywhere, align them, hmmbuild, and turn the outuput in an hmm-source
, am I understanding it correctly?
Much like the (I think) defunct
anvi-script-pfam-accessions-to-hmms-directory
I didn't know it was defunct? I just used it a few days ago 😂
Apologies, I think I' either thinking of a different script or had trouble accessing Pfam and assumed the merge broke the script.
Yes, you're understanding is correct.
Came here for a different issue, but saw this and wanted to drop my two cents. At least for some COGs, the definition is quite broad (see the somewhat frequent "or related enzyme" appendix to the function name) and that might cause problems when these are turned into HMMs without further curation. if a feature like this is (primarily) user facing, that might cause some data analysis issues
Thanks for your input, Daan. That's also my concern since the sensitivity and specificity of the models will be all over the place since the number of genes and their conservancy within COGs are all over the place. I wonder what is it we could do to make sure people don't reach misleading conclusions with a tool like this if implemented.
Agreed. I was thinking along the lines of automated sequence dataset cleanup and wonder if there could be sanity checks applied to the genes retrieved, to see whether it is sensible that they indeed represent homologous proteins. I have no immediate good suggestions, other than a blunt length assessment.
I tried average pairwise identity the other day, but that wasn't very helpful for the use case I was looking at. All v All pairwise alignment does help identify outlier sequences as those only aligning to a small subset of the tested sequences.
What about clustering with an identity threshold akin to usearch cluster_fast
?
Perhaps a reduced scope would be a script that makes an HMM given protein sequences, and their corresponding gene names, at least to begin with.
It seems to me that it would be suffucient to have a wrapper for hmmbuild and some downstream code to turn the resulting HMM(s) to conform with user-defined hmm-source file structure. A script like anvi-script-pfam-accessions-to-hmms-directory but anvi-script-fasta-to-hmms-directory
.
Hey @meren,
Just getting back to this. I was looking through scripts I've written in the pasts that do this and came upon this script.
Would something like this written as a python script be useful at all?
The reason for that question is that while brainstorming what such a script would look like I ended up with a simple script that took in lots of arguments (essentially one per HMM source file), and was then wondering if it's not easier for a user to do this on their own. This type of script would be more useful if the user could give a .hmm
file, and need to specify at most 1/2 parameters if Anvi'o could infer the rest.
Hey @Ge0rges,
Would something like this written as a python script be useful at all?
Absolutely! Any script with good documentation and example use cases would be useful! We all need these kinds of functionality, and if the script is discoverable, then it becomes a lifesaver :)
Please feel free to go at it and send a PR. I would be more than happy to work with you on this and serve as your guinea pig!
PR submitted.
💪
The need
HMMs are leveraged widely in anvio to identify genes for various metrics. Enabling easier access to custom HMM would allow people to more easily ask their own questions about what a bacterium is doing, e.g. looking for antimicrobial resistance genes, looking for mobile genes, etc. I have been thinking about this this week after discussing with @ivagljiva ways to make HMMs less convoluted.
The solution
Much like the (I think) defunct
anvi-script-pfam-accessions-to-hmms-directory
ananvi-script
which takes in a list of COG IDs and generates an anvio compatible HMM.This would also be the occasion to facilitate the generation of HMMs generally. A backend script which takes in multiple FASTA files, one per gene, and creates an HMM could also be made user-facing so that more advanced users could provide directly protein sequences. In conjunction with the new functionality of
anti-run-ncbi-cogs
those could be directly annotated to populate thegenes.txt
file and turn into an anviohmm-source
.Finally, this would be the chance to revisit the structure of
hmm-source
even if just modestly, to put all the one liner files together, if necessary.Beneficiaries
Most users of anvio looking at functionality of their communities.