oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
429 stars 51 forks source link

CRISPR detection error or something else? #242

Closed Jigyasa3 closed 10 months ago

Jigyasa3 commented 12 months ago

Hi all, Thanks again for generating an all-in-one annotation pipeline with great documentation!

This is in reference to issue #232, I am getting a similar error where BAKTA cannot find an existing CRISPR array in some samples. I compared the results with CRISPRCasFinder online and found that the CRISPR array is present in these genomes. Interestingly, in my case, samples that were annotated by BAKTA as not containing a CRISPR array but annotated by CRISPRCasFinder to contain one, did not have Cas proteins. Below are some examples. All the genomes are publicly available on NCBI/IMG to reproduce the results.

genomeID | BAKTA_CRISPRannotation | CRISPRCasFinder_annotation | Cas_present
RS_GCF_000620465.1 | NO | NO |  
BraspOAE829 | NO | YES | NO
BurspOAS925 | NO | YES | NO
RS_GCF_000419565.1 | NO | YES | NO
RS_GCF_000698265.1 | NO | NO |  
RS_GCF_000007565.2 | NO | YES | NO
RS_GCF_000279285.1 | YES | YES | YES <positive control>

So my question is that is there a bug or some filtering feature in BAKTA that's giving this output. Looking forward to your reply!

Jigyasa3 commented 12 months ago

Hi,

Got some more updates. I compared the RS_GCF_000279285.1 results from BAKTA and CRISPRCasFinder. BAKTA finds one CRISPR array in this sample start=602511, stop= 603618 while CRISPRCasFinder finds two (screenshot below). I think BAKTA is doing a stringent cutoff for finding CRISPR and this is probably leading to missed annotation. I examined the CRISPR array present in the location start=92144, stop=92347 which is missed by BAKTA, and this CRISPR is associated with the Tn7 system! Any suggestions to decrease/remove the stringency in BAKTA for CRISPR annotation?

CRISPRCasFinder results-

Screenshot 2023-09-22 at 6 49 30 PM
oschwengers commented 11 months ago

Hi @Jigyasa3, thanks a lot for reaching out and this deeper comparison!

Bakta only executes PILER-CR with default parameters and accepts whatever it predicts. Hence, there are no CRISPR-related filters within Bakta. The default parameters of PILER-CR due to its usage are:

Criteria for CRISPR detection, defaults in parentheses:
   -minarray <N>          Must be at least <n> repeats in array (3).
   -mincons <F>           Minimum conservation (0.9).
                            At least N repeats must have identity
                            >= F with the consensus sequence.
                            Value is in range 0 .. 1.0.
                            It is recommended to use a value < 1.0
                            because using 1.0 may suppress true
                            arrays due to boundary misidentification.
   -minrepeat <L>         Minimum repeat length (16).
   -maxrepeat <L>         Maximum repeat length (64).
   -minspacer <L>         Minimum spacer length (8).
   -maxspacer <L>         Maximum spacer length (64).
   -minrepeatratio <R>    Minimum repeat ratio (0.9).
   -minspacerratio <R>    Minimum spacer ratio (0.75).
                            'Ratios' are defined as minlength / maxlength,
                            thus a value close to 1.0 requires lengths to
                            be similar, 1.0 means identical lengths.
                            Spacer lengths sometimes vary significantly, so
                            the default ratio is smaller. As with -mincons,
                            using 1.0 is not recommended.

Parameters for creating local alignments:
   -minhitlength <L>      Minimum alignment length (16).
   -minid <F>             Minimum identity (0.94).

Of course, the various prediction tools (PILER-CR, CRISPRCasFinder, Minced, etc) produce slighty different results. Sometimes a is better than b and vice versa. Hence, it's hard to tell, which one is right.

I have the feeling that CRISPRCasFinder seems to be better and certainly better maintained than PILER-CR what makes it a very intersting candidate for Bakta. However, we'd need to add a bunch of additional dependencies and also, CRISPRCasFinder requires its own database that we would need to handle somehow. I don't say this is a no-go, but currently I'm a bit reluctant to add all this extra complexity to Bakta.

oschwengers commented 10 months ago

Hi @Jigyasa3, not solving your issue, but somehow related. I just merged a new PR #249 improving the CRISPR information. Now, Bakta also provides information on the CRISPR spacer sequences. Maybe this is of interest for you.

oschwengers commented 10 months ago

So, having thought about this a bit longer, I think this is a regular case with varying outputs of different tools. Hence, I guess, we cannot do anything about that in principle. As explained, both PILER-CR and MinCED are not actively maintained anymore but novel tools are too complex having too many dependencies themselves.

Hence, it might be best to maybe use these tools in external dedicated analysis and use this information outside of the actual genome annotation process?

I'm sorry, that I cannot provide any more help here. Thus, I'd close this for now. But please do not hesitate to re-open this or a new issue in any case. Thanks again and best regards!

Jigyasa3 commented 10 months ago

Hi @oschwengers,

Thank you for replying and explaining the background PILER-CR CRISPR annotation. I am specifically interested in examining the genetic background of CRISPR arrays. For example, some CRISPR arrays are associated with tns proteins (https://pubmed.ncbi.nlm.nih.gov/34845024/). But if the CRISPR array is missed by BAKTA, then the genomic background gets annotated by protein-coding genes. My question is, if I use CRISPRCasFinder to annotate CRISPR arrays, can I still use BAKTA to examine the genomic neighborhood of the region of interest? Would the ORFs be correct?

Jigyasa3 commented 10 months ago

Hi @oschwengers ,

I can answer my own question here- BAKTAs ORF preserves the gene neighborhood of the CRISPR array even though PILER-CR cannot find the CRISPR.

Thanks for a great annotation software!