Closed alexweisberg closed 1 year ago
Thanks @alexweisberg for asking. I totally see your point providing more detailed CRISPR annotations, since these are available. However, pilercr
's format is not the best in terms of parsing and there are interesting alternatives out there. Hence, before expanding and re-implementing the pilercr
parser, I'd rather take a closer look at these alternatives which might provide even further information, as for example the orientation and typing information of the CRISPR array.
Ah I see. Yes I agree, that is a good idea to consider other options first. Thanks! On Nov 21, 2022, 8:35 AM -0800, Oliver Schwengers @.***>, wrote:
Thanks @alexweisberg for asking. I totally see your point providing more detailed CRISPR annotations, since these are available. However, pilercr's format is not the best in terms of parsing and there are interesting alternatives out there. Hence, before expanding and re-implementing the pilercr parser, I'd rather take a closer look at these alternatives which might provide even further information, as for example the orientation and typing information of the CRISPR array. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi @alexweisberg, I know this is almost a year ago, but nevertheless I wanted to let you know that I was following-up on your request regarding CRISPR spacers. After having checked several alternatives, I came to the conclusion that unfortunately, the more recent and better alternatives do not seem as suitable replacements for PILER-CR in Bakta due to mere technical reasons, as for instance too many dependencies.
Hence, I improved the current PILER-CR parser to extract positions and sequences of all CRISPR spacers. They are now stored within the JSON
, GFF3
and TSV
output files. the INSDC file formats (Genbank/EMBL) unfortunately do not have suitable feature keys for this - If I have overlooked something here, please tell me.
I just merged #249 which will be available in an upcoming 1.9.0
. Thanks for your patience on this. I'll close this for now, but please, feel free to add any thoughts or comments and do not hesitate to re-open it.
Best regards!
Hi Oliver,
Thank you for looking into this! I appreciate it. Having them in the gff3 and tsv output is useful. Perhaps a “misc_feature” in the GenBank file would work? Or a ‘regulatory’ or ‘repeat_region’ feature as described here: https://www.ncbi.nlm.nih.gov/refseq/functionalelements/. This could mark the position of either the spacers themselves or the repeat borders. If not, having them in gff3 is useful.
Best, Alex On Oct 24, 2023 at 9:06 AM -0700, Oliver Schwengers @.***>, wrote:
Closed #171 as completed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Once again, thanks for this great software, bakta works really well. I have a feature request regarding the CRISPR annotation. As of now, it looks like bakta annotates crispr arrays as something like the following:
contig1 PILER-CR CRISPR 177465 178532 . ? . ID=PEDKEEMDFJ_60;Name=CRISPR array with 18 repeats of length 28%2C consensus sequence GGACCATCCCCGCTTGCGCGGGGAATAC and spacer length 33;product=CRISPR array with 18 repeats of length 28%2C consensus sequence GGACCATCCCCGCTTGCGCGGGGAATAC and spacer length 33
It would be nice if bakta also produced an output file (or had an option to include in the full output) that contained the positions and sequence of each CRISPR spacer as well.
This seems to be produced by pilercr: https://www.drive5.com/pilercr/example.htm
But it is not incorporated into the final gff3 or gbff files in bakta. Given that some arrays can be very long, I can see how this would crowd the output. But it would be nice to maybe have a separate file listing these so that we dont have to rerun pilercr to get it easily. Thanks!