pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
270 stars 94 forks source link

window size quantification clarification #11

Closed bbsos closed 5 years ago

bbsos commented 5 years ago

Hi, I've noticed a discrepancy in the window quantification information, where different default window sizes are noted. I have CRISPResso 2.025 installed and it says default window size is 1. However in the online manual it says default window size is 2 bp (1 upstream, 1 downstream). I have used crispresso 1 previously and remember total window size in bp was double the inputted value (2 bp window size would be 4 bp total, 2 on each side). In crispresso 2 it looks like this parameter was changed. This is important since we would like to keep the same window sizes for our analyzes that were run before crispesso 2 was released. I am thinking the --help documentation is incorrect, is this correct?

When looking at CRISPResso --help [CRISPresso version 2.0.25] -w QUANTIFICATION_WINDOW_SIZE ............... (default: 1)

In the crispresso 2 online manual: -w or --quantification_window_size ............... Default size is 2bp, which is 1bp up- and 1bp downstream from the quantification window center. (default: 2)

Any clarification is much appreciated, Thanks! Brandon

bbsos commented 5 years ago

thanks for the update! TLDR is for the -w QUANTIFICATION_WINDOW_SIZE parameter; "Default is 1, 1bp on each side of the cleavage position for a total length of 2bp. (default: 1)"

kaanokay commented 7 months ago

Dear @kclem

Thank you for developing that method.

I have been running CRISPResso2 with WGS mode for WGS ONT data.

-f argument is taking a text file where there is +-50bp locations from sgRNA start and end location. Once I gave only exact sgRNA location to that text file, CRISPResso2 did not work properly. I wonder why it did not work and I wonder whether I'm detecting knockout efficiency on target site properly.

When I specified -w argument to 20, this gave me different %percent of modified reads (75%) than its default setting (62.5%).

I wonder how can I get the most accurate number of reads with modification with -f and -w arguments?

What should be location of sgRNAs for -f argument? and what should be window size -w argument to count number of reads with modifications?

Another question is that how can I run more than one bam files by specifying different name for each to create bam file specific folders where results will be saved in.

Does CRISPResso2 take account sequencing error in target region? Some reads with modification might have sequencing errors. How can I distinguish sequencing errors and CRISPR events?

Any recommendation is appreciated.

All the best.

kclem commented 7 months ago

Hi @kaanokay,

Thank you for developing that method.

And thank you for using CRISPResso!

When I specified -w argument to 20, this gave me different %percent of modified reads (75%) than its default setting (62.5%).

I'm guessing WGS failed when you only provided the sgRNA start and end location because by default CRISPResso discards the first and last 15bp of regions because they may contain sequencing errors (see the --exclude_bp_from_left and --exclude_bp_from_right parameters), and if you only provide the ~20bp sgRNA position, the entire region is excluded from quantification. I guess you should have gotten an error message with something along these lines. Now that you provide +/- 50bp this problem is resolved.

The -w argument specifies the size of the quantification window. Only insertions/deletions/substitutions that overlap this window will cause a read to be counted as 'modified'. The rationale is that insertions/deletions/substitutions outside this window are likely due to sequencing errors or other noise instead of CRISPR editing which should only happen at the predicted cut site. Ideally, the quantification window is centered at the predicted cut position and is as small as possible. By default, it is 1bp, meaning 1bp to the left and right of the predicted cut position. As you set -w 20, this will increase the quantification window size so that insertions/deletions/substitutions away from the cleavage position now make the read count as 'modified'. Because this is ONT data, this is likely due to errors in sequencing that are increasing your %modified reads. I would use the default 1bp window unless you suspect that editing may occur not at the predicted cleavage position.

I wonder how can I get the most accurate number of reads with modification with -f and -w arguments?

What should be location of sgRNAs for -f argument? and what should be window size -w argument to count number of reads with modifications?

I would use the default -w 1 parameter. If your ONT reads cover the region well, you can leave the +/-50bp window for your -f file. Otherwise, you can go down to +/- 20bp and set --exclude_bp_from_left 0 --exclude_bp_from_right 0. For the -f file, provide the sgRNA spacer sequence excluding the PAM sequence (but the PAM on the right of the provided sequence).

Another question is that how can I run more than one bam files by specifying different name for each to create bam file specific folders where results will be saved in.

Are you referring to the bam containing the input reads? CRISPresso expects all input to be in one bam file. You can change the output name by using the parameters -o and -n.

Does CRISPResso2 take account sequencing error in target region? Some reads with modification might have sequencing errors. How can I distinguish sequencing errors and CRISPR events?

Yes, CRISPResso uses the quantification window to reduce the effect of sequencing errors. You can also run CRISPRessoCompare with a control/unedited sample to control for sequencing errors within the quantification window.

I hope that helps! Feel free to reach out if you have any other questions or problems!

kaanokay commented 7 months ago

Dear @kclem

Thank you so much for detailed answers. Appreciated for all. I got your points which make sense to me.

I have control sample and Kmt2d knockout sample. Once I calculate number of reads with modification in control sample by using +-50bp up and downstream of Kmt2d sgRNA location, I noticed that control sample have some reads with modification on target site of Kmt2d sgRNA 1. I'm assuming that this is an error rather than to be any CRISPR events. In that case, do you have any recommendation for cutoff of number of reads with modification to distinguish which event is sequencing error or CRISPR event.

CRISPRessoWGS_modification_summary

All the best.

kclem commented 7 months ago

I like to look at the allele plots (plot 9) to see the alleles that are produced at each site. Can you click on the "Kmt2d.sgRNA.1" report and upload the plot 9? Perhaps there is a hard-to-sequence region or something else that may be making the reads 'modified'. Feel free to email me if you'd prefer - k.clement@utah.edu

kaanokay commented 7 months ago

That's Kmt2d.sgRNA.1 plot 9:

Screenshot from 2024-02-14 18-02-29

Colelyman commented 7 months ago

@kaanokay as an aside, if you would like to have the nucleotide labels show on these plots you can downgrade the matplotlib to version 3.7.2 or earlier. We are still trying to figure out why the labels don't show up in later versions of matplotlib...

kaanokay commented 7 months ago

@Colelyman, thank you for suggestion, I'd like to downgrade it and try again.

I have mice data where FVB and B6 mouse genomes mixed, bam file aligned to only B6 reference. Perhaps, that modified reads because of indel/SNP differences between B6 and FVB genomes? Is that make sense? If CRISPResso2 is handling sequencing errors, those modified reads may be not sequencing errors instead differences between two mouse genomes.

kclem commented 7 months ago

@kaanokay If you look at the allele plot, you can see that 2 or 3bp deletions (gray blocks) are pretty common at the left and right parts of this sequence away from the cut site, so the deletions at the cut site could just be due to noise. Do you have UMIs or other tags to let you know whether these are real reads or just sequencing errors?

kaanokay commented 7 months ago
  1. I'm unsure whether I have those UMIs reporting sequencing errors, but I'll check them ASAP.

  2. Should the predicted cut site always be within the location of the sgRNA? Isn't it possible that the cut site is somewhere outside of the sgRNA? I ask this because CRISPR-related mutations can also be seen in locations other than sgRNA location (as I know). The fact that the window size determined by the -w parameter is so limited may mean that we may miss some CRISPR events. I understand that restricting the window size prevents modified reads from being obtained due to sequencing errors, but considering that CRISPR creates unique mutations and assuming that some of them can occur outside the location of the sgRNA, I think such a strict window size will lead us to skip some CRISPR events.

  3. I've been trying to understand which reads contain CRISPR events for a while now. I use samtools mpileup for this. I think the 5th column of mpileup can give us information about reads with CRISPR events, but I do not fully understand how to count these reads from mpileup output. I think characters like ^ and $ are reads with CRISPR events, and I use these characters when counting reads. I am thinking of calculating knockout efficiency by using BCFtools and comparing the mutation frequencies of control and knockout samples. Do you think this would be a feasible approach? Any suggestions on this would be great.

  4. When I checked number of read in target site by IGV, I noticed that number of total reads in IGV and the number of total reads CRISPResso2 reported were different. CRISPResso2 is reporting less number of reads than manual IGV visualization in target site. The samples aligned by minimap2, is that make any difference compare to BWA or bowtie in terms of alignment of reads and sure counting of reads by CRISPResso2? Am I missing some reads in CRISPResso2 because of minimap2 alignment? Is that make sense?

All the best.

Colelyman commented 7 months ago

@kaanokay

Response to 2, if you would like to expand the quantification window outside of the sgRNA location you can use the -qwc (--quantification_window_coordinates) parameter to specify any area of the amplicon where you would like mutations to be counted as edits. Obviously, this could lead to false positives due to sequencing error, but if you have a control those can be mitigated using CRISPRessoCompare.

Response to 3, I haven't used mpileup in years, so I can't help with what the ^ and $ characters mean. But seems like you may be able to achieve a similar result with CRISPRessoCompare. I think with the mpileup approach you would still need some notion of what an edit is and what is a sequencing error (essentially replicating the notion of the quantification window implemented in CRISPResso2).

Response to 4, if you have any of the filtering parameters set that could be a reason CRISPResso2 is reporting less reads. Furthermore, CRISPResso2 could also be filtering reads that have low homology to your amplicon. If you would like to test the latter, you can try and set --amplicon_min_alignment_score 0 (where the default is 60, representing 60% homology) so that CRISPResso won't do any of that filtering.

kaanokay commented 7 months ago

@Colelyman,

I'm planning to restrict target region to 1 or 2bp up and downstream of putative predicted cut site coordinate. If I do not do that, I'm feeling like my pipeline will yield high noise because of both sequencing error and mixed mice genome. I'm suprised that counting modified reads is tightly restricted to 1-2 bp up and down of predicted cut site. It nice to know that.

I'll set up --amplicon_min_alignment_score argument to 0 to able to catch all possible reads in target site.

Thank you so much for quick answering!

All the best.

kclem commented 7 months ago

Also, note that CRISPRessoWGS only selects reads that cover the entire region of interest (e.g. the entire 40bp if using a region size of +/- 20bp in your -f file. For this reason, IGV may show that there are more reads that start or end alignment in the middle of each region.