Open hyattpd opened 4 years ago
This sounds great. Will the new version have an option to build the models from a given set of files or will there be a snippet somewhere in the docs on how to build your own models? (Since as of right now I've no clue about how I would go about doing that.) Thanks!
I would envision a possibly separate project, Prodigal-models
, which uses high quality annotations from Refseq or Ecocyc etc to make models for specific organisms (genera, species, ...). Many of the genomes in Genbank were annotated with GeneMarkS or Prodigal, so important not to create a garbage cycle :)
Very nice! Might be trivial and obvious: As compute clusters often don't have an i-net connection, please make the genetic code fetching logic optional upon user request but not mandatory. :-)
A new paper from Sean Eddy's lab discovered and validated four new bacterial genetic codes. I'm numbering them 34-37 (since there are 33 at GenBank now). Could these be incorporated? Since they're sense codon variants of codes 11 and 25, it might not need retraining and be rather simple to fix. Or better, maybe the -g option could accept any user-specified 64-character genetic code string. (EDITED: Replaced asterisks with periods to avoid reformatting) 34 (UBA4682 bacteria) FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRMVVVVAAAADDEEGGGG 35 (Peptacetobacter) FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRQIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG 36 (Anaerococcus, UBA4855) FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRWIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG 37 (Absconditabacterales) FFLLSSSSYY..CCGWLLLLPPPPHHQQRRWWIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
The current way sequence.c
is set up is a bit of a problem here. Ideally we should replace these if
s with tables that can be swapped out quickly. Something like:
First create a function in bitmap.c
: uchar get_3bases(uchar *bm, int ndx)
. It slices 6 bits from ndx, corresponding to 64 codons. It can probably be a little cleverer than test(bm, ndx) | test(bm,ndx+1) << 1 | ... | test(bm,ndx+5) << 5
, but that's implementation detail.
Then we add a new member under struct _training
to store the table. A char table[64]
will do. The lower 7 bits of table[x]
is the ASCII one-letter translation. The top bit can be used to indicate initiation. As a result, we get very simple translation implementations:
int is_start(unsigned char *seq, int n, struct _training *tinf) {
return tinf->table[get_3bases(seq,n)] >> 7;
}
char amino(unsigned char *seq, int n, struct _training *tinf, int is_init) {
if(is_start(seq, n, tinf) == 1 && is_init == 1) return 'M';
return tinf->table[get_3bases(seq,n)] & 0x7f;
}
int is_stop(unsigned char *seq, int n, struct _training *tinf) {
return (tinf->table[get_3bases(seq,n)] & 0x7f) == '*';
}
.. And of course finally we have to face the real work, which is to parse the prt and build tinf->table.
Continuing discussion from here: https://github.com/hyattpd/Prodigal/issues/31
Current proposal:
Metagenomic genetic codes (open for discussion):