oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
426 stars 51 forks source link

Addition of custom codon table/improve 'rare' start codons prediction #308

Open DanielleMStevens opened 1 month ago

DanielleMStevens commented 1 month ago

I work on Gram-positive actinobacteria, which can frequently encoded alternative starts. I've noticed some genes are annotated with a wrong start (frequently too long to comply with an ATG start instead of GTG) or missed all together. While a tblastN and manual correction can fix this, would be nice to provide a custom codon table or provide an option for possible alternate starts.

I know prodigal has a few other codon tables but most are generic and don't cover many bacteria taxa. Not sure how easy it is to implement but would be a massive improvement for annotation accuracy in non-model bacteria. For example, if I calculate start codon usage or even codon usage from a reference, could it be used as an optional input to improve start site prediction. Or maybe integrate ORFfinder as it has an option to predict orfs with alternate starts?

Otherwise love this package and all its developments! Thanks!!

oschwengers commented 1 month ago

Hi and thanks for asking, and yes, that would be indeed a very nice feature. For the strucural prediction of CDS, Bakta takes advantage of Pyrodigal. Unfortunately, currently there is no support for custom genetic codes which was recently discussed: https://github.com/althonos/pyrodigal/issues/59

However, we have a student in our lab, who is currently working on an script to automate the tblastn approach. The idea is, that you can provide your genome and a set of protein sequences which will be mapped onto the six frame-translated genome sequence. After some filtering, it will output gene locations in GFF3 format which then can be used in a normal Bakta run via --regions

DanielleMStevens commented 1 month ago

Yeah, sadly I am aware. I went through much of the prodigal's open/closed issues to see if they ever made a fix for this a month or two ago to no luck. Thought that with a custom prodigal training file and --proteins tags and file it would catch all the unique genes for this group of organisms but later realized with some downstream analyses that some were missed.

That would be fantastic and should work as --regions would be trained on the dna/protein sequence! I I am sure you have plenty but if you need test genomes with known points of failure, I am happy to supply a couple.

Seriously thanks again for building and maintaining this package!