pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
263 stars 92 forks source link

Why is WGS mode recommended for Nanopore data? #340

Closed francoiskroll closed 10 months ago

francoiskroll commented 10 months ago

I see in previous issues that you currently recommend the WGS mode to analyse Nanopore data using crispresso2.

May I ask why?

Thank you for the hard work!

kclem commented 10 months ago

Hi @francoiskroll, Thanks for using CRISPResso2!

Briefly, we recommend WGS mode for two reasons: 1) Users are generally interested in a specific region (e.g. the CRISPR target site) and analyzing the edits that occur there. This allows them to see the exact allele sequences that were produced. Unfortunately, it becomes unwieldy to visualize long-read data at the allele level, so CRISPRessoWGS trims long reads down to the region of interest for easy quantification and visualization. 2) Our biologically-aware global alignment algorithm is guaranteed to produce alignments where indels overlap the predicted cut site (as long as the cut site parameters are correctly provided), but it takes a substantial amount of time and memory (O(n^2)) to run on long reads. Other aligners are fast and reasonably accurate, however, without performing global alignment, there is a chance that indels may be aligned to incorrect locations, resulting in incorrect quantification. Unfortunately, the tradeoff for this accuracy is time and memory, and while global alignment is reasonable for short reads it becomes too time- or memory-intensive for long reads.

We're working on tools for long read analysis, with the intent of providing users with accurate quantification and intuitive visualization of large and small edits. If you have any suggestions, or would like to be involved in the development process, drop me a line - k.clement@utah.edu.

francoiskroll commented 10 months ago

Thank you for the detailed answer.

Sorry, I think I'm missing something.

From 2., I am understanding that the WGS mode is using a suboptimal aligner to save compute. Is this correct? i.e. that ideally, one would use the same aligner as standard mode to ensure that the indels are in the most likely position.

But in #341, you recommend analysing a ~100 bp window centered on the cut site. Presumably, CRISPResso2 trims the reads to that window before running any compute-intensive analysis. Therefore, do we still need to worry about compute after trimming?

Intuitively my logic would be: 1) keep long reads, have to use suboptimal aligner; or 2) trim them so we can use best aligner. Am I missing something?

Thank you for the invite to contribute. I will try to set up something that works for my dataset, then happy to share any solution I find.

Excellent to hear CRISPResso for Nanopore/long reads is in the pipeline. I'd be happy to serve as beta tester at some point.

kclem commented 10 months ago

Sorry for the confusion. CRISPRessoWGS currently uses the same aligner as CRISPResso. CRISPRessoWGS first trims reads to 100bp, then runs CRISPResso on the trimmed reads.

Alternately you could run CRISPResso on the entire long read, but is probably too time and memory intensive, so we recommend running CRISPResssoWGS.

Does that make sense?