Transfer annotations from similar genome

simone-pignotti commented 1 year ago

Hi, thanks for developing this great tool! I believe a great addition to the workflow would be the possibility to provide a pre-annotated genome as input, allowing to:

speed up annotation for batches of similar genomes (e.g. when sequencing mutants of the same reference strain)
facilitate manual curation (after curating a reference genome, no need to manually re-apply the same modifications to others sharing the same features)
increase annotation consistency across a genome collection (useful for e.g. pangenomic analyses)

While this is partially covered by the --proteins parameter, there are cases like #216 and #245 where this is not sufficient.

Developing the liftover feature from scratch would be rather challenging, but there are several tools for that (Liftoff, Flo, nf-LO, TOGA) which theoretically should work on prokaryotic genomes. Bakta could then accept the partially annotated genome resulting from a liftover pipeline, and "finish" the annotation process. For example, with an additional input file containing the alignment to the reference genome (e.g. minimap2 intermediary output from Liftoff), it could extract unmapped regions, annotate them and "merge" the output.

Thanks again, let me know if something isn't clear. I totally understand if this is not something you'd want to support in bakta, but in any case I believe there's definitely an unmet need for it!

oschwengers commented 1 year ago

Ok, since more and more users ask for a feature like this, I've started a new issue #250 collecting ideas and requirements. Feedback and input is highly welcome!

simone-pignotti commented 1 year ago

Great, thank you very much @oschwengers!

oschwengers commented 11 months ago

OK, Having a deeper look into this, and merging #250, I think we can address some of the use cases you've described above:

For a single genome of interest, CDS coordinates can now be provided via --regions. In addition, accompanying functional annotation can be provided via --proteins.
This can be achieved by providing a trusted set of proteins with custom annotations via --proteins.

Of note, both options (--regions / --proteins) can be used independently from each other, so you can specify an a priori found CDS and/or provide custom functional annotation for certain protein sequences - either de novo predicted or a priori user-provided.

However, of course, this setup is not ideal in batch situations, so for 2. this would currently not work. Also, 1. is not true, since these additional user-provided information are fed into the regular annotation workflow.

Skipping the entire de novo gene prediction would require further command line parameters. Furthermore, since even closely related genomes will not have exact gene start/stop matches, one would have to search for these. As you already mentioned, this is a fairly complex task if you do not want to introduce any false positives. You can find some very brief thoughts on this here: https://github.com/oschwengers/bakta/issues/250#issuecomment-1820468623

Hence, I'll keep this issue open, but put into the backlog for now. Of course, any further comments, ideas, thoughts are highly welcome!

simone-pignotti commented 11 months ago

Thank you very much @oschwengers, this is already very useful and I think it adds more flexibility and interoperability to bakta 👍 with a companion liftover tool and some scripting, 2. could be almost fully solved (only for CDS, although other lifted features could also be merged into bakta's GFF output with a custom script, ideally part of the official auxiliary scripts).

Is there any official guideline for auxiliary scripts? Like dependency management, coding style, testing needs...

oschwengers commented 11 months ago

Is there any official guideline for auxiliary scripts? Like dependency management, coding style, testing needs...

Not yet, but in a nutshell:

Dependencies: I'm reluctant to introduce any new dependencies if not absolutely required and useful. For the sake of simplicity and code maintenance, I'd strictly opt for Python's modules or existing 3rd-party dependencies.
Coding style: I do not believe that my style is special or even better by any means - I love and try to stick to the KISS principle: keep it simple stupid. Hence, since I do most of the maintenance work, I'd kindly ask to stick to the coding style that Bakta and its existing auxiliary scripts use.
Testing: For auxiliary scripts, I would not go for distinct CI tests, but thorough hands-on tests using different genomes should be done before submitting a PR. The test/test_genomes.nf Nextflow script automatically downloads and annotates 49 taxonomically diverse genomes. I regularly use this set for broader tests. So that might be a good starter.

oschwengers / bakta

Transfer annotations from similar genome #247