merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
432 stars 145 forks source link

Script to generate HMM directory from HMM file. Add Defense Finder Models HMM. #2244

Closed Ge0rges closed 6 months ago

Ge0rges commented 6 months ago

Referencing #2164 here is a script that allows a user to turn an HMM file into an anvio HMM directory if they also specify the source.

In addition, I've added another script which automatically turns the mdmparis/defense-finder-models into an anvi'o compatible HMM directory. Some of the HMM files don't define an accession number, so I use the name instead on those cases.

meren commented 6 months ago

Dear @Ge0rges, thank you very much for this PR!

Thank you also for making use of the anvio/utils function as much as possible without reimplementing things that are already there. In that vein, I thought the function get_attribute_from_hmm_file in sandbox/anvi-script-gen-defense-finder-models-to-hmm-directory could be a good addition to utils with some help docs in the function header.

I also think sandbox/anvi-script-hmm-to-hmm-directory will be a very useful script to automatize a lot of things. I thought the use of --hmm-list and --hmm-source parameters will be a bit confusing (especially if there are a lot of HMMs with different sources). In these cases we generate a new 'artifact', like a two-column TAB-delimited file, for instance in this case to list paths for models and their sources, to be passed to a program to avoid too much wrangling in the command line. But I think we can wait for an actual need to implement that in the future.

There are two things missing in this PR, and it would be excellent to add them if you have time and/or energy. Otherwise I can add them later. First one is new entries under anvio/help/docs/programs for these new scripts. Just so there is some online help that ties them to the rest of the software ecosystem and that people can read and see some examples, understand their utility, etc. The second one is a minimal running example to add into anvio/tests/run_component_tests_for_metagenomics.sh so every night these scripts are tested and if something breaks we learn about it immediately. If you don't have energy for these updates let me know, and I'll merge the PR :)

Best wishes, Meren

Ge0rges commented 6 months ago

@meren Thanks for the tips. I updated the code to move get_attribute_from_hmm_file to utils with some error handling. I also added the requested docs including for anvi-script-pfam-accession-to-hmm-directory as I did not find one.

I did not add the commands to the test file as I would rather leave that to you if that's OK.

meren commented 6 months ago

Thank you very much for these updates, @Ge0rges! I am merging your PR and will test them while adding the entries for our component tests :)

I also included your GitHub account as a collaborator to anv'o project, so you now have direct write access to the repository (which I hope will make it easier for you to directly commit changes to master when you see fit, or submit PRs or branches directly and from your fork).

meren commented 6 months ago

By the way, I'm getting the following error from anvi-script-gen-defense-finder-models-to-hmm-directory -- I didn't look into it but I thought I'd mention since probably it will make immediate sense to you:

$ anvi-script-gen-defense-finder-models-to-hmm-directory
  File "/Users/meren/github/anvio/sandbox/anvi-script-gen-defense-finder-models-to-hmm-directory", line 77
    try:
    ^^^
SyntaxError: expected 'except' or 'finally' block
Ge0rges commented 6 months ago

Hi @meren

Thank you very much for your trust. Sorry about those two bugs, I will push a fix within a couple hours. I haven't actually yet setup anvio on my Mac yet, so my coding workflow is a bit crap and I clearly forgot to pull my last commit on my test server.

Just to clarify I still would plan to make PRs for any significant change for you to review, but would perhaps push directly small changes to fix minor bugs or typos for example.

meren commented 6 months ago

Just to clarify I still would plan to make PRs for any significant change for you to review, but would perhaps push directly small changes to fix minor bugs or typos for example.

Sure! Everyone who contributes with direct write access does that more or less. Whenever we are uncertain, or feel like it would be better to have other sets of eyes on the code, we send in PRs and ask for reviewer input :)

Working directly with the repo makes contributing much easier nevertheless. My anvi'o setup on my system uses anvio-dev, and it makes it whole lot easier to fix/update the code or documentation as I work through datasets and so on.

Ge0rges commented 6 months ago

@meren just a heads up I fixed the try catch block. Also is the author list case sensitive?

meren commented 6 months ago

is the author list case sensitive?

lol, yes, unfortunately, and I did the lazy thing -- rather than updating our code, I updated your username :p

And thanks for the fix, @Ge0rges. I finally was able to test anvi-script-gen-defense-finder-models-to-hmm-directory, and run it on the Infant Gut Dataset just to have an idea about the hits in this collection of models.

Here is how each model and their hits looked like:

image

I was surprised to realize that one of the models, Paris II, was responsible for quite a remarkable number of these hits:

image

It resolves to PF13304. It seems it is in the list because "Several members are annotated as being of the abortive phage resistance system, in which case the family would be acting as the toxin for a type IV toxin-antitoxin resistance system", but in reality it also has significant similarity to your good old ATP hydrolyses that are almost in every genome. Indeed I searched a few genes from the list of hits, and they were involved in protein binding or ATP binding activity.

Well. The long story short, it is working, but perhaps the models are not too specific.