New version discussion: genetic codes

hyattpd commented 4 years ago

Continuing discussion from here: https://github.com/hyattpd/Prodigal/issues/31

Current proposal:

User can specify a single genetic code.
The genetic codes will be obtained from https://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
Go version will be able to fetch the most recent version of the above file from NCBI.
User can specify their own genetic codes in the same format.

Metagenomic genetic codes (open for discussion):

User can specify which genetic codes to use (or not use) from the pre-built models via some flag. Files that don't match the code won't be used.
User can add their own models to the metagenomic set, which might add new genetic codes automatically.

mhuntemann commented 4 years ago

This sounds great. Will the new version have an option to build the models from a given set of files or will there be a snippet somewhere in the docs on how to build your own models? (Since as of right now I've no clue about how I would go about doing that.) Thanks!

tseemann commented 4 years ago

I would envision a possibly separate project, Prodigal-models, which uses high quality annotations from Refseq or Ecocyc etc to make models for specific organisms (genera, species, ...). Many of the genomes in Genbank were annotated with GeneMarkS or Prodigal, so important not to create a garbage cycle :)

oschwengers commented 4 years ago

Very nice! Might be trivial and obvious: As compute clusters often don't have an i-net connection, please make the genetic code fetching logic optional upon user request but not mandatory. :-)

kpwilliams commented 2 years ago

A new paper from Sean Eddy's lab discovered and validated four new bacterial genetic codes. I'm numbering them 34-37 (since there are 33 at GenBank now). Could these be incorporated? Since they're sense codon variants of codes 11 and 25, it might not need retraining and be rather simple to fix. Or better, maybe the -g option could accept any user-specified 64-character genetic code string. (EDITED: Replaced asterisks with periods to avoid reformatting) 34 (UBA4682 bacteria) FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRMVVVVAAAADDEEGGGG 35 (Peptacetobacter) FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRQIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG 36 (Anaerococcus, UBA4855) FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRWIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG 37 (Absconditabacterales) FFLLSSSSYY..CCGWLLLLPPPPHHQQRRWWIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

Artoria2e5 commented 9 months ago

The current way sequence.c is set up is a bit of a problem here. Ideally we should replace these ifs with tables that can be swapped out quickly. Something like:

First create a function in bitmap.c: uchar get_3bases(uchar *bm, int ndx). It slices 6 bits from ndx, corresponding to 64 codons. It can probably be a little cleverer than test(bm, ndx) | test(bm,ndx+1) << 1 | ... | test(bm,ndx+5) << 5, but that's implementation detail.

Then we add a new member under struct _training to store the table. A char table[64] will do. The lower 7 bits of table[x] is the ASCII one-letter translation. The top bit can be used to indicate initiation. As a result, we get very simple translation implementations:

int is_start(unsigned char *seq, int n, struct _training *tinf) {
  return tinf->table[get_3bases(seq,n)] >> 7;
}

char amino(unsigned char *seq, int n, struct _training *tinf, int is_init) {
  if(is_start(seq, n, tinf) == 1 && is_init == 1) return 'M';
  return tinf->table[get_3bases(seq,n)] & 0x7f;
}

int is_stop(unsigned char *seq, int n, struct _training *tinf) {
  return (tinf->table[get_3bases(seq,n)] & 0x7f) == '*';
}

.. And of course finally we have to face the real work, which is to parse the prt and build tinf->table.

illustrative code that probably won't work

```c // from is_a() and friends #define P_T 0 #define P_G 1 #define P_C 2 #define P_A 3 #define P_3(x, y, z) ((x) << 4 | (y) << 2 || (z)) // ncbi codon order: tt{t,c,a,g}, tc{t,c,a,g}, ... #define N_T 0 #define N_C 1 #define N_A 2 #define N_G 3 #define N_3(x, y, z) ((x) << 4 | (y) << 2 || (z)) // converts codon order const char ncbi_codon_to_prod[64]; ncbi_codon_to_prod[N_3(N_T, N_T, N_T)] = P_3(P_T, P_T, P_T); /* pretend there are 63 lines more */ void parse_code(char* out_table, const char ncbieaa[65], const char sncbieaa[65]) { char temp_table[64]; memcpy(temp_table, ncbieaa, 64); for (int i = 0; i <= 64; i++) { // this does not work for Blastocrithidia (would have no stops), // but it's not in our scope anyways if (sncbieaa[i] == 'M') temp_table[i] |= 0x80; out_table[ncbi_codon_to_prod[i]] = temp_table[i]; } } parse_code(_tinf->table, "FFLLSSSSYY**CCGWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", "---M------**-----------------------M---------------M------------"); ```

hyattpd / Prodigal

New version discussion: genetic codes #62