Open ivagljiva opened 2 months ago
Using pyrodigal
as default or having it as option would be great. There are known bugs in prodigal
that will never be addressed (not being dev anymore) but have been fixed in pyrodigal
, e.g. problems with the gene calls in the reverse strand.
Thanks @xvazquezc, I just found out all the unfixed bugs in prodigal that were fixed in prodigal-gv and pyrogidal/pyrogidal-gv!
Just for documentation, here are some known issues:
We should default to pyrodigal/pyrogidal-gv.
pyrodigal-gv
is just a tiny layer on top of pyrodigal
, so it would be trivial to have an flag that allows the user to disable the additional gene models that are included in pyrodigal-gv
. prodigal-gv
would be a simpler change from Prodigal and it does include the fixes, but pyrodigal-gv
is faster and it makes it much easier to get gene data, as you won't have to parse Prodigal/prodigal-gv
outputs.
I don't know how multi-threading is managed in anvi'o, but maybe this will be relevant: https://github.com/althonos/pyrodigal/issues/57
implementing
prodigal-gv
could be as simple as adding a variable to store either prodigal orprodigal-gv
according to user input, and replacing all instances of calling prodigal with this variable. It would use the same driver/parser modules as prodigal uses, and in theory no further changes would be necessary
Depending on how you are parsing Prodigal's outputs, you might to change the parsing code a bit because prodigal-gv
includes the genetic code in the outputs (https://github.com/apcamargo/prodigal-gv/commit/120c77947812a5cfc2fe1bad3e6cfe468ca9eb4e). Since having alternative genetic codes was one of the main reasons I developed prodigal-gv
in the first place, I decided to make it obvious to users when a model with an alternative code was used.
Thank you very much for your input, @apcamargo. I will work on this and try to come up with a modular solution.
Sure! Let me know if there's anything I can help.
Another (minor) consequence of changing the gene caller that I just remembered, and that is somewhat related to an issue that I opened a few months ago (https://github.com/merenlab/anvio/issues/2195), is that the alternative genetic codes are not taken into account in anvi-gen-variability-profile
. This is an issue even in vanilla Prodigal (which includes translation table 4 in the metagenome mode), but using prodigal-gv
and pyrodigal-gv
would increase the amount of sequences translated with alternative codes (translation table 15).
In my data, I wrote the code to compute pN/pS from scratch (due to the bug in the potential computation I linked above) and, as far as I remember, the effect of alternative genetic codes in pN/pS was negligible. So, I don't think this is something super important, but could be good to have in mind.
A small project to improve anvi'o, based upon feedback/ideas @FlorianTrigodet and I heard from our colleagues at the QIB in Norwich.
The need
There is interest in being able to use alternative gene calling software in addition to
prodigal
, within anvi'o (ie, instead of having to run gene calling outside of anvi'o and using external gene calls). We've heard specifically aboutprodigal-gv
, a fork ofprodigal
that has additions to improve gene calling for viruses, andpyrodigal
/pyrodigal-gv
which are the respective Python modules for using these software directly in the code. However, there could be other gene callers of interest to the community.The solution
This small project is flexible in scope depending on which gene calling software we want to support and how far you (the developer) want to go with the refactor. Here are some possibilities:
prodigal-gv
could be as simple as adding a variable to store eitherprodigal
orprodigal-gv
according to user input, and replacing all instances of callingprodigal
with this variable. It would use the same driver/parser modules asprodigal
uses, and in theory no further changes would be necessarypyrodigal
options would require changes to how we actually run the gene calling step. We would no longer use a driver program that runs theprodigal
binary, but would switch that to using thepyrodigal
classes directly. Multi-threading and parsing of the results would also have to change to be compatible with those classes (they are thread-safe but it looks like we would still manage the multi-threading on our own).Beneficiaries
All users of anvi'o, but (in the case of
prodigal-gv
) especially those who work on viruses.