tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
813 stars 222 forks source link

long CRISPR repeats #209

Closed haruosuz closed 7 years ago

haruosuz commented 7 years ago

"CRISPR repeats range in size from 24 to 48 base pairs.[56]" (https://en.wikipedia.org/wiki/CRISPR). "Analysis of the current CRISPR database24 reveals that repeats range from 23- to 50-nt long and have an average length of 31 nt" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2928866/). For Roseovarius genome annotation using Prokka v1.11 produced long CRISPR repeats (2477 bp) as follows:

 repeat_region   61467..63943
                 /rpt_family="CRISPR"
ctSkennerton commented 7 years ago

CRISPR regions are composed of many repeat units that are each 23 to 50 bp. Prokka is only annotating the bounds of the CRISPR region rather than each individual repeat unit.

tseemann commented 7 years ago

I am using minced to do the prediction, author is @ctSkennerton .

I am currently only using the -gff option but I could use -gffFull and annotate the repeat units.

          -gff       Output summary results in gff format containing
                     only the positions of the CRISPR arrays. Default: false
          -gffFull   Output detailed results in gff format containing
                     positions of CRISPR arrays and all repeat units. Default: false
tseemann commented 7 years ago

Here is the meaning of the different rpt_family tags: http://www.insdc.org/controlled-vocabulary-rpttype-qualifier

I found this example in Genbank: http://www.genome.jp/dbget-bin/www_bget?refseq+NC_015970

repeat_region   46177..46632
                     /inference="COORDINATES: alignment:crt:1.2"
                     /inference="COORDINATES: alignment:pilercr:v1.02"
                     /rpt_family="CRISPR"
                     /rpt_type=direct
                     /rpt_unit_range=46177..46204
                     /rpt_unit_seq="gggtcatccctgcgcgcgcgggagtcgg"

Here is an example of minced -gffFull:

##gff-version 3
gi|384860682|ref|NC_017341.1|   minced:0.2.0    CRISPR  2421118 2421311 4       .       .       ID=CRISPR1
gi|384860682|ref|NC_017341.1|   minced:0.2.0    repeat_unit     2421118 2421140 1       .       .       Parent=CRISPR1;ID=DR1
gi|384860682|ref|NC_017341.1|   minced:0.2.0    repeat_unit     2421174 2421196 1       .       .       Parent=CRISPR1;ID=DR2
gi|384860682|ref|NC_017341.1|   minced:0.2.0    repeat_unit     2421233 2421255 1       .       .       Parent=CRISPR1;ID=DR3
gi|384860682|ref|NC_017341.1|   minced:0.2.0    repeat_unit     2421289 2421311 1       .       .       Parent=CRISPR1;ID=DR4

And of minced -spacers:

Sequence 'gi|384860682|ref|NC_017341.1|' (2924344 bp)
CRISPR 1   Range: 2421118 - 2421311
POSITION        REPEAT                          SPACER
--------        ----------------------- ----------------------------------
2421118         TGTTGGGGCCCCGCCAACTTGCA CATTATTGTATGCTGACTTTTCGTCACCTTCTG       [ 23, 33 ]
2421174         TGTTGGGGCCCCGTTCCCCAACT TGCATTGTCTGTAGAATTTCTTTTTGAAATTCTCTA    [ 23, 36 ]                              2421233         TGTTGGGGCCCCGCCAACTTGCA CATTATTGTAAGCTGACTTTCTGTCAGCTTCTG       [ 23, 33 ]
2421289         TGTTGGGGCCCCGCCAACTTGTA
--------        ----------------------- ----------------------------------
Repeats: 4      Average Length: 23              Average Length: 34
tseemann commented 7 years ago

I have updated Prokka to at least tell you how many repeat units there are. I will look at added repeat units in v1.13.

     repeat_region   2421118..2421311
                     /note="CRISPR with 4 repeat units"                                                                                      
                     /rpt_family="CRISPR"
                     /rpt_type=direct