add CRISPR spacer output

oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

GNU General Public License v3.0

453 stars 55 forks source link

add CRISPR spacer output #171

Closed alexweisberg closed 1 year ago

alexweisberg commented 2 years ago

Once again, thanks for this great software, bakta works really well. I have a feature request regarding the CRISPR annotation. As of now, it looks like bakta annotates crispr arrays as something like the following:

contig1 PILER-CR CRISPR 177465 178532 . ? . ID=PEDKEEMDFJ_60;Name=CRISPR array with 18 repeats of length 28%2C consensus sequence GGACCATCCCCGCTTGCGCGGGGAATAC and spacer length 33;product=CRISPR array with 18 repeats of length 28%2C consensus sequence GGACCATCCCCGCTTGCGCGGGGAATAC and spacer length 33

It would be nice if bakta also produced an output file (or had an option to include in the full output) that contained the positions and sequence of each CRISPR spacer as well.

This seems to be produced by pilercr: https://www.drive5.com/pilercr/example.htm

But it is not incorporated into the final gff3 or gbff files in bakta. Given that some arrays can be very long, I can see how this would crowd the output. But it would be nice to maybe have a separate file listing these so that we dont have to rerun pilercr to get it easily. Thanks!

oschwengers commented 2 years ago

Thanks @alexweisberg for asking. I totally see your point providing more detailed CRISPR annotations, since these are available. However, pilercr's format is not the best in terms of parsing and there are interesting alternatives out there. Hence, before expanding and re-implementing the pilercr parser, I'd rather take a closer look at these alternatives which might provide even further information, as for example the orientation and typing information of the CRISPR array.

alexweisberg commented 2 years ago

Ah I see. Yes I agree, that is a good idea to consider other options first. Thanks! On Nov 21, 2022, 8:35 AM -0800, Oliver Schwengers @.***>, wrote:

Thanks @alexweisberg for asking. I totally see your point providing more detailed CRISPR annotations, since these are available. However, pilercr's format is not the best in terms of parsing and there are interesting alternatives out there. Hence, before expanding and re-implementing the pilercr parser, I'd rather take a closer look at these alternatives which might provide even further information, as for example the orientation and typing information of the CRISPR array. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

oschwengers commented 1 year ago

Hi @alexweisberg, I know this is almost a year ago, but nevertheless I wanted to let you know that I was following-up on your request regarding CRISPR spacers. After having checked several alternatives, I came to the conclusion that unfortunately, the more recent and better alternatives do not seem as suitable replacements for PILER-CR in Bakta due to mere technical reasons, as for instance too many dependencies.

Hence, I improved the current PILER-CR parser to extract positions and sequences of all CRISPR spacers. They are now stored within the JSON, GFF3 and TSV output files. the INSDC file formats (Genbank/EMBL) unfortunately do not have suitable feature keys for this - If I have overlooked something here, please tell me.

I just merged #249 which will be available in an upcoming 1.9.0. Thanks for your patience on this. I'll close this for now, but please, feel free to add any thoughts or comments and do not hesitate to re-open it. Best regards!

alexweisberg commented 1 year ago

Hi Oliver,

Thank you for looking into this! I appreciate it. Having them in the gff3 and tsv output is useful. Perhaps a “misc_feature” in the GenBank file would work? Or a ‘regulatory’ or ‘repeat_region’ feature as described here: https://www.ncbi.nlm.nih.gov/refseq/functionalelements/. This could mark the position of either the spacers themselves or the repeat borders. If not, having them in gff3 is useful.

Best, Alex On Oct 24, 2023 at 9:06 AM -0700, Oliver Schwengers @.***>, wrote:

Closed #171 as completed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>