hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
433 stars 85 forks source link

Feature Request: custom translation table #31

Closed jolespin closed 4 years ago

jolespin commented 6 years ago

In future versions, are there plans to set an option to input custom translation tables?

Possibly in the format of:

1. The Standard Code (transl_table=1)

By default all transl_table in GenBank flatfiles are equal to id 1, and this is not shown. When transl_table is not equal to id 1, it is shown as a qualifier on the CDS feature.

    AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
  Starts = ---M------**--*----M---------------M----------------------------
  Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

tseemann commented 6 years ago

@jolespin what is the use case? do you have data from something not in the offical genetic code list?

jolespin commented 6 years ago

I was thinking about the situation where researchers suspected that Gracillibacteria used one of their stop codons as an amino acid during translation and created a custom table. I believe that was the finding that introduced translation table 25 but I may be mistaken. For my data, there was a situation in which I thought one of my draft genomes had a similar property and I wanted to call the ORFs using a custom genetic code. If it would be easy to implement, I think it could be really beneficial for users studying microbial "dark matter". I'm not sure what the best input format would be but, at first glance, a tab delimited table could be easy to generate. However, not sure how easy that would be to implement in the actual gene calling using custom codes in the backend though.

hyattpd commented 6 years ago

Maybe, I'll think about it, if it's something multiple people would want. Right now, you can still find most everything since there are enough codes that have different combinations of stop codons. It would only really be an issue for truly weird tables that use non-TAA/TGA/TAG stops. Even 25 would still "work" with 4; it would just mistranslate a codon.

jolespin commented 6 years ago

Supplementary

http://science.sciencemag.org/content/sci/344/6186/909.full.pdf?ijkey=GQIeXHcKFgVWQ&keytype=ref&siteid=sci

Identification of reassigned contigs in assembled metagenome data Prodigal (19) software has been modified to add one non-standard genetic code, in which TAA is reassigned to Gln.

Main paper for reference

http://science.sciencemag.org/content/sci/suppl/2014/05/21/344.6186.909.DC1/Ivanova.SM.pdf

This JGI paper made a hack to incorporate the ochre codon for dark matter microbes.

hyattpd commented 4 years ago

I've decided to support this in the Go version as a differential from an existing genetic code.

tseemann commented 4 years ago

@hyattpd will the Go version just use the machine readable genetic code tables?

This is in ASN1 text format: https://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt

It has the latest tables:

--
--  Version 4.5
--     Added Cephalodiscidae mitochondrial genetic code 33
--

Ideally the user could provide a custom one in that format:

  name "Mars Rover Microbe" ,
  id 42 ,
  ncbieaa  "FFLLSSSSYYYYCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
  sncbieaa "--------------*--------------------M----------------------------"
  -- Base1  TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
  -- Base2  TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
  -- Base3  TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
hyattpd commented 4 years ago

That's a good idea.

hyattpd commented 4 years ago

For performance reasons, will probably hard code 1, 4, 11, and 25, and only read genetic codes from this file if the user specifies a weird one (or via a flag that says use the file).

mhuntemann commented 4 years ago

@hyattpd on a related note: will the new Go version also give users the option to build their own models? Excited to hear there's a new version on the horizon!

jolespin commented 4 years ago

Sorry I think I missed something. What is the “Go” version exactly?

On Sep 30, 2019, at 4:24 PM, Marcel Huntemann notifications@github.com wrote:

@hyattpd on a related note: will the new Go version also give users the option to build their own models? Excited to hear there's a new version on the horizon!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

hyattpd commented 4 years ago

@jolespin The Go programming language (Golang). I'm writing the next version in Go.

@mhuntemann Not sure what's meant by "build your own models"...

mhuntemann commented 4 years ago

Hi @hyattpd, maybe I am remembering it wrong, but I thought that in the meta mode Prodigal uses models that are built from publicly available genomes (and their genes) at that time (when you wrote the last version). If that's the case, I assume those models get updated for the new version, but there are probably people (including us) that have databases with genomes that are not publicly available yet. It would be nice if there was a way to create models from a specific set of currently private genomes that I am interested in and add them to the set of models (the new) Prodigal uses in meta mode. Or am I remembering it incorrectly and it works completely different? Thanks, Marcel

hyattpd commented 4 years ago

Ah ok, I get it. Yeah, I think I can support that.

mhuntemann commented 4 years ago

Awesome! Thanks for really listening to the community. :-) Really looking forward to the new version. Do you have a rough release roadmap yet?

mhuntemann commented 4 years ago

Btw.: if you need any beta tester or some kind of feedback on any new version, I am happy to help with that. We are running your 2.63. version on several hundred genomes and metagenomes every month. So there's enough data to encounter edge cases I'd assume. :-)

hyattpd commented 4 years ago

I imagine the way I will implement this is just to have single or multiple modes, and the user just passes in a list of files they've made themselves in addition or in place of the preset models. I can also give the preset models shorthand ids and provide a complete list, and allow the user to specify whichever of those they want.

tseemann commented 4 years ago

@hyattpd "For performance reasons" you will hard code tables? Modern compilers are amazingly good at optimizing, especially if you use const properly. I would write the first version completely generically, then optimize later.