merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
426 stars 145 forks source link

[FEATURE REQUEST] Different gene caller option(s) for `anvi-gen-contigs-database` #2298

Open ivagljiva opened 2 months ago

ivagljiva commented 2 months ago

A small project to improve anvi'o, based upon feedback/ideas @FlorianTrigodet and I heard from our colleagues at the QIB in Norwich.

The need

There is interest in being able to use alternative gene calling software in addition to prodigal, within anvi'o (ie, instead of having to run gene calling outside of anvi'o and using external gene calls). We've heard specifically about prodigal-gv, a fork of prodigal that has additions to improve gene calling for viruses, and pyrodigal/pyrodigal-gv which are the respective Python modules for using these software directly in the code. However, there could be other gene callers of interest to the community.

The solution

This small project is flexible in scope depending on which gene calling software we want to support and how far you (the developer) want to go with the refactor. Here are some possibilities:

Beneficiaries

All users of anvi'o, but (in the case of prodigal-gv) especially those who work on viruses.

xvazquezc commented 2 months ago

Using pyrodigal as default or having it as option would be great. There are known bugs in prodigal that will never be addressed (not being dev anymore) but have been fixed in pyrodigal, e.g. problems with the gene calls in the reverse strand.

FlorianTrigodet commented 2 months ago

Thanks @xvazquezc, I just found out all the unfixed bugs in prodigal that were fixed in prodigal-gv and pyrogidal/pyrogidal-gv!

Just for documentation, here are some known issues:

We should default to pyrodigal/pyrogidal-gv.

apcamargo commented 2 months ago

pyrodigal-gv is just a tiny layer on top of pyrodigal, so it would be trivial to have an flag that allows the user to disable the additional gene models that are included in pyrodigal-gv. prodigal-gv would be a simpler change from Prodigal and it does include the fixes, but pyrodigal-gv is faster and it makes it much easier to get gene data, as you won't have to parse Prodigal/prodigal-gv outputs.

I don't know how multi-threading is managed in anvi'o, but maybe this will be relevant: https://github.com/althonos/pyrodigal/issues/57

implementing prodigal-gv could be as simple as adding a variable to store either prodigal or prodigal-gv according to user input, and replacing all instances of calling prodigal with this variable. It would use the same driver/parser modules as prodigal uses, and in theory no further changes would be necessary

Depending on how you are parsing Prodigal's outputs, you might to change the parsing code a bit because prodigal-gv includes the genetic code in the outputs (https://github.com/apcamargo/prodigal-gv/commit/120c77947812a5cfc2fe1bad3e6cfe468ca9eb4e). Since having alternative genetic codes was one of the main reasons I developed prodigal-gv in the first place, I decided to make it obvious to users when a model with an alternative code was used.

meren commented 2 months ago

Thank you very much for your input, @apcamargo. I will work on this and try to come up with a modular solution.

apcamargo commented 2 months ago

Sure! Let me know if there's anything I can help.

Another (minor) consequence of changing the gene caller that I just remembered, and that is somewhat related to an issue that I opened a few months ago (https://github.com/merenlab/anvio/issues/2195), is that the alternative genetic codes are not taken into account in anvi-gen-variability-profile. This is an issue even in vanilla Prodigal (which includes translation table 4 in the metagenome mode), but using prodigal-gv and pyrodigal-gv would increase the amount of sequences translated with alternative codes (translation table 15).

In my data, I wrote the code to compute pN/pS from scratch (due to the bug in the potential computation I linked above) and, as far as I remember, the effect of alternative genetic codes in pN/pS was negligible. So, I don't think this is something super important, but could be good to have in mind.