pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
270 stars 94 forks source link

Substitutions and N's outside quantification window counted in alleles frequency table #356

Open GreenSeaBug opened 10 months ago

GreenSeaBug commented 10 months ago

Hello,

I have -w 1 and -wc -3, yet substitutions and N's outside the quantification window seem to be counted as edited in the allele frequency table. For example, in the image below, none of the sequences have indels within the 2 bp quantification window, and the substitutions / N's are at least 19 bp away from the cut site...

image

What is going on here? Is there a way to exclude those reads from the analysis?

Everything else in the analysis worked as expected.

Thanks for your help.

kclem commented 10 months ago

Hi @GreenSeaBug,

Thanks for using CRISPResso, and sorry about the confusion with the allele display of N's and substitutions.

The allele plot will show substitutions and N's outside of the quantification window as different alleles, but they won't make the corresponding reads count as 'modified' or 'edited'. The allele plot is only for visualization, and if we were to collapse substitutions or N's to only show a single unedited allele it would not be an accurate representation of the data.

If you open the text file associated with the allele plot (e.g. Alleles_frequency_table_around_sgRNA_GAG...txt) you will see a table where the rows should correspond to alleles in your allele plot. This table includes a column 'Unedited' which is set to 'False' for reads that are 'Modified'. The rows that contain N's or substitutions outside the quantification window should be set to 'True' meaning that although the sequence of the allele is not the same as the reference sequence, the read is not classified as 'modified'.

If you'd like, you can annotate all the unmodified alleles using the command --annotate_wildtype_allele ** for example.

If you still think there is a problem, could you upload the allele table and provide the command you used to run CRISPResso, as well as the alleles you believe are problematic?

GreenSeaBug commented 10 months ago

Thank you for the reply. That all makes sense. However, it seems the data in the .txt file do not match what is shown in the alleles visualisation plot.

For example in the table below it says that 83.68% are edited with -2, but in the visualisation plot it shows 88.86% are unedited (perfectly match reference in the quantification window).

image

Does this seem strange or am I missing something obvious?

Also, is there any way to exclude from the analysis reads with substitutions or N's outside the quantification window?

kclem commented 10 months ago

The mismatch of numbers (88.86% vs 83.86%) is because alleles with the same visual sequence have been collapsed to a single allele for plotting. That is, there are 88.86% of reads with the sequence shown in the plot, but the alleles couldn't be collapsed in the table because they have differences (snps or N's) that are outside of the plotting window.

For example imagine a sample with the reads in the allele frequency table:

ACTGAG - 80%
TCTGAG - 12%
AC-GAG - 8%

If the plotting window were the 2nd to 5th bases, the first two alleles would be collapsed so the alleles plotted would be:

CTGA - 92%
C-GA - 8%

I'm not sure what you mean to exclude the reads with substitutions or Ns. Do you mean that they would be collapsed in the allele plots so the N or substitution would visually be replaced by a base in the reference sequence? If so, I'd be wary of doing that because it doesn't represent the underlying data.

If you want reads with substitutions or N's to not make the read 'Modified' you can use the flag --ignore_substitutions.

GreenSeaBug commented 10 months ago

OK, that makes sense and explains the discrepancy in percentages. However, I don't think it explains why the table says those 83.86 are edited with -2 bp deletion, while the plot says those 88.86% are unedited WT. What do you think?

As for excluding reads with substitutions or N's, no I am not wanting to collapse those reads in the allele plot to visually replace the substitutions and N's. I agree that would not be a good idea. Nor am I wanting to prevent reads with substitutions or N's within the quantification window from being classified as modified. Rather, I am wondering if it is possible to exclude these reads from the analysis entirely, that is, filter them out. In my case, and I would think in a lot of cases, they are just sequencing errors or reads derived from chimeric amplicons that are an artefact of PCR.

kclem commented 10 months ago

I assumed the plot showing 88% unedited was away from your quantification window - is that not the case?

You can exclude reads with N by filtering them out before CRISPResso analysis, and passing CRISPResso your filtered reads. Here's a script to filter reads based on the presence of a specific sequence: filterReadsOnSequencePresence.py - try running with --exclude_seq N.

GreenSeaBug commented 10 months ago

Yes, the part of the plot that I showed is away from the quantification window. Here is the quantification window...

image

As you can see, nothing is modified. So I don't understand why the table says almost everything is modified (mostly -2 bp deletion).

Thank you for the script for the N's! Is there a way to also exclude reads with substitutions outside the quantification window?

kclem commented 10 months ago

Is the plot above the entire quantification window? If you look at the entire quantification window you should be able to see the 2bp deletion. If you'd prefer not to post here you can email me at k.clement@utah.edu.

For filtering, if you run with '--write_detailed_allele_table' CRISPResso will add a column to the 'Alleles_frequency_table.zip' file for "all_substitution_positions". You can filter for only alleles where this column is empty ("[]")

GreenSeaBug commented 10 months ago

The plot above includes more than the entire quantification window. I have -w set to the default, 1. So the quantification window is 2 bp. As you can see there are no edits either side of the quantification window centre. There is no -2 bp deletion. So it seems like a complete mismatch with the allele frequency table .txt file.

OK thank you for the tip on substitutions.