Macromolecular structure and integron annotation, custom features

alexweisberg commented 2 years ago

Is your feature request related to an existing issue or bug? no

Is your new feature related to a general problem? Typically annotation of large secretion systems (T3SS, T4SS, T6SS, etc) can be spotty and gene names differ a lot between organisms. Its also not clear if these are predicted to be complete structures. Likewise, its not always clear where integron systems are located and which genes are in cassettes.

Describe the solution you'd like It would be really nice if Bakta could run macsyfinder (https://github.com/gem-pasteur/macsyfinder) with the TXSScan models (https://github.com/macsy-models/TXSScan) and incorporate that into the annotation. Running IntegronFinder (https://github.com/gem-pasteur/Integron_Finder) and marking cassette borders in the annotations would be helpful too. Similarly, IS element boundaries could be marked with ISEScan (https://github.com/xiezhq/ISEScan) or prophage regions with DBSCAN-SWA (https://github.com/HIT-ImmunologyLab/DBSCAN-SWA/)

Describe alternatives you've considered I totally understand that not every analysis tool or pipeline could or even should be added to bakta. Alternatively if adding these as options is too time consuming or complex, some way to run them separately and then use a bakta script to update the annotations with this information would be great. This would also be nice for custom annotations of features from an input table file, like integrated mobile element repeats, Agrobacterium T-DNA borders, etc.

Thanks!

oschwengers commented 2 years ago

Hi @alexweisberg , thanks a lot for reaching out with these thorough considerations. We've thought about this a lot and I'd love to enhance annotations of T?SS -and MGE-related proteins as well as adding annotations for these structures and MGEs themselves. Unfortunately, it's hard to decide where to stop in the workflow and which tools to integrate. There are tons of different analyses people are conducting and all of them could improve the overall annotation of a genome. However, This is of course not feasible both in terms of the effort this would require and the increased runtime this would induce.

Therefore, I currently tend to leave these dedicated analyses out of the Bakta workflow. I'll take a deeper look at all the related HMM & covariance models, which might be useful for the pre-computed annotations (db creation). Although from my experience these models are a great resource for the detection of such features but not necessarily for the annotation of the proteins. I'm sorry that I cannot be of more help here. I will keep this open, so others can add their thoughts on this.

Thanks again!

alexweisberg commented 2 years ago

Hi Oliver, Thank you for getting back to me on this, I appreciate the thorough reply. I agree that it may be best to leave these as extra analyses outside of Bakta. Many of these programs change quickly and what might be "best" differs over time.

For the hmm model I was thinking that would be more useful for annotating specific DNA sites like dif sites, promoters, T-DNA borders, etc rather than protein domains. I recently have been working on scripts for updating gbk/gff files with extra elements or annotations so I think leaving these out of Bakta for now is fine. Thanks!

oschwengers / bakta

Macromolecular structure and integron annotation, custom features #117