Open simone-pignotti opened 1 year ago
Ok, since more and more users ask for a feature like this, I've started a new issue #250 collecting ideas and requirements. Feedback and input is highly welcome!
Great, thank you very much @oschwengers!
OK, Having a deeper look into this, and merging #250, I think we can address some of the use cases you've described above:
CDS
coordinates can now be provided via --regions
. In addition, accompanying functional annotation can be provided via --proteins
.--proteins
.Of note, both options (--regions
/ --proteins
) can be used independently from each other, so you can specify an a priori found CDS and/or provide custom functional annotation for certain protein sequences - either de novo predicted or a priori user-provided.
However, of course, this setup is not ideal in batch situations, so for 2.
this would currently not work. Also, 1.
is not true, since these additional user-provided information are fed into the regular annotation workflow.
Skipping the entire de novo gene prediction would require further command line parameters. Furthermore, since even closely related genomes will not have exact gene start/stop matches, one would have to search for these. As you already mentioned, this is a fairly complex task if you do not want to introduce any false positives. You can find some very brief thoughts on this here: https://github.com/oschwengers/bakta/issues/250#issuecomment-1820468623
Hence, I'll keep this issue open, but put into the backlog for now. Of course, any further comments, ideas, thoughts are highly welcome!
Thank you very much @oschwengers, this is already very useful and I think it adds more flexibility and interoperability to bakta 👍 with a companion liftover tool and some scripting, 2.
could be almost fully solved (only for CDS, although other lifted features could also be merged into bakta's GFF output with a custom script, ideally part of the official auxiliary scripts).
Is there any official guideline for auxiliary scripts? Like dependency management, coding style, testing needs...
Is there any official guideline for auxiliary scripts? Like dependency management, coding style, testing needs...
Not yet, but in a nutshell:
Hi, thanks for developing this great tool! I believe a great addition to the workflow would be the possibility to provide a pre-annotated genome as input, allowing to:
While this is partially covered by the
--proteins
parameter, there are cases like #216 and #245 where this is not sufficient.Developing the liftover feature from scratch would be rather challenging, but there are several tools for that (Liftoff, Flo, nf-LO, TOGA) which theoretically should work on prokaryotic genomes. Bakta could then accept the partially annotated genome resulting from a liftover pipeline, and "finish" the annotation process. For example, with an additional input file containing the alignment to the reference genome (e.g. minimap2 intermediary output from Liftoff), it could extract unmapped regions, annotate them and "merge" the output.
Thanks again, let me know if something isn't clear. I totally understand if this is not something you'd want to support in bakta, but in any case I believe there's definitely an unmet need for it!