pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
256 stars 91 forks source link

Confused about sequence directions in prime editing parameters #426

Open francoiskroll opened 2 months ago

francoiskroll commented 2 months ago

In case it's helpful, I think some of the documentation on Prime editing parameters could be clarified. Do let me know if it's all confusion on my side, it may very well be!


--prime_editing_pegRNA_spacer_seq, Documentation says "The sequence should be given in the RNA 5'->3' order, so for Cas9, the PAM would be on the right side of the given sequence.". If the gRNA binds the Forward 5'–3' genome strand (i.e. is complementary to it), I cannot see how one can give the sequence in 5'–3' RNA direction and at the same time have the PAM on the right. I think the PAM has to be on the left in this case? Here is an example: Screenshot 2024-04-24 at 19 44 54 The way I read the Documentation, CRISPResso2 wants GACCTGTAGACGGTATGTGT in this case (my blue sequence, read from left to right), but that returns

ERROR: The prime editing pegRNA spacer sequence appears to be given in the 3'->5' order. The prime editing pegRNA spacer sequence (--prime_editing_pegRNA_spacer_seq) must be given in the RNA 5'->3' order.

I do not think that is correct, the RNA 5'–3' sequence really is GACCTG... here. After trial and error, I found CRISPResso2 is happy with CTGGACATCTGCCATACACA, but that is not the RNA 5'–3' sequence with the PAM on the right, as can be seen in my cartoon.


--prime_editing_pegRNA_extension_seq has a similar issue I think. In my example, 5'–3' RNA direction (as requested by Documentation) is CCGTCTACAGGTCTGCGCTGCTTCG (my yellow sequence, read from right to left), but that returns

ERROR: The pegRNA spacer aligns to the pegRNA extension sequence in 3'->5' direction. The prime editing pegRNA spacer sequence (--prime_editing_pegRNA_spacer_seq) must be given in the RNA 5'->3' order, and the pegRNA extension sequence (--prime_editing_pegRNA_extension_seq) must be given in the 5'->3' order. In other words, the pegRNA spacer sequence should be found in the given reference sequence, and the reverse complement of the pegRNA extension sequence should be found in the reference sequence.


--prime_editing_pegRNA_scaffold_seq: from what I can tell, the example ("ends with GGCACCGAGUCGGUGC") is indeed in 5'–3' RNA direction. Is CRISPResso2 completely agnostic to U vs T? This example made me doubt whether I was meant to input actual RNA sequence (which adds to the confusion from above as CRISPResso2 does not in fact accept the real RNA sequence in 5'–3' direction).


I think the most logical would be to ask for the RNA sequence in 5'–3' direction for every sequence, but you probably do not want to edit the code about directions etc. (probably wise). Alternatively, I think the way to go would probably be to put some version of that last error message as Documentation.

I hope it's helpful! Thanks for the amazing work.

kclem commented 2 months ago

Thanks @francoiskroll for the feedback.

You're using Cas9, right? I think the PAM should be on the 3' end of your spacer sequence, right? https://images.contentstack.io/v3/assets/blte41c17d7f8dda379/bltcfb92f996bceea35/5e97816a89fc5535892fa74e/prime_editing_process_mpg.svg?format=webply&quality=90&width=1920

We had tried to make it so the sequence input to CRISPResso was the same as the sequences used when ordering your pegRNAs. But due to variation in how people design pegRNAs, it's become a little more complicated.

The critical sentence I always use is "The pegRNA spacer sequence should be found in the given reference sequence, and the reverse complement of the pegRNA extension sequence should be found in the reference sequence." This works when I'm targeting a change to the upper/positive/forward strand.

All U's get substituted to T's internally.

Do you have suggestions for what we could include in the manual, or in the errors that would make the input format more clear?

francoiskroll commented 2 months ago

Haa, my cartoon was wrong! Thanks for making me notice. Here is a corrected/more precise version.

Screenshot 2024-04-25 at 10 52 42

Ok, makes a bit more sense now. PAM is indeed on the RNA 3' side of the spacer. I guess that would be "PAM on the right" if I was showing the 3'–5' Reverse genome strand on top (or reading my screen upside down like a bat), is that what the Documentation means?

What still puzzles me is: if I now give UGUGUAUGGCAGAUGUCCAG (blue RNA sequence, read right to left) it still complains (spacer sequence must be given in the RNA 5'->3' order). I still think that error message is incorrect in this case; or am I still getting something wrong? Say I try to buy that crRNA from IDT, it gives me AltR1 – rUrG rUrGrU rArUrG [...] – AltR2, so I think that is really the 5'–3' RNA sequence?

Yes, I think that critical sentence in Documentation would help. Sounds like it would be agnostic to which strand people use as reference etc.

Thanks for the discussion!

kclem commented 2 months ago

Thanks @francoiskroll,

We generally have thought of prime editing in this orientation from Wikipedia in which the 5' end of the pegRNA with the spacer pairing with the bottom strand and the PBS and RT template binding to the top strand.

In this orientation, the reference sequence would be the unmodified sequence, and the changes encoded in the pegRNA extension sequence would be directly incorporated into the reference sequence. E.g. if you are trying to introduce an A>TCG change on the top strand, the reference would include an 'A' at the site, and the prime-edited sequences would include a 'TCG' (as opposed to thinking in the opposite strand where the reference would show 'T' and the prime-edited would show 'CGA', which seems more bat-like to me personally🦇).

We'll update the documentation with this sentence, and hopefully that will make things clearer for the community.

francoiskroll commented 2 months ago

Argh, sorry, I have another one to bother you with...

CRISPResso --fastq_r1 ./R1.fastq.gz --fastq_r2 ./R2.fastq.gz --amplicon_seq ATTCAGTCAGGACGGCAAACTGGATGGTGCCATGGCCCTGAATGCGGTCTGGAGGGTTGCGCACATAGCAGTTGTAGATGCCCTCGTCTGTCAGCTGCACGTCCGAGATGGTGATGGAAAGGTCGTTCTTATCCAGGTTTCCAGTAAACATGACTCGCTCTCCAAATCGACTGGGCTTCAAGGGTATCATCTTATAGGTGTTC --amplicon_name test --prime_editing_pegRNA_spacer_seq TGGAAAGGTCGTTCTTATCC --prime_editing_pegRNA_extension_seq ACTGGAAACGTGGATAAGAACGAC --prime_editing_pegRNA_scaffold_seq GTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC --output_folder ./crispressoOut

Throws "ERROR: [...] In other words, the pegRNA spacer sequence should be found in the given reference sequence, and the reverse complement of the pegRNA extension sequence should be found in the reference sequence. [...]", but looks to me like it's all matching your criteria:

CAGCTGCACGTCCGAGATGGTGATGGAAAGGTCGTTCTTATCCAGGTTTCCAGTAAA # amplicon, trimmed for display
-----------------------TGGAAAGGTCGTTCTTATCC-------------- # spacer
------------------------------GTCGTTCTTATCCACGTTTCCAGT--- # reverse-complement of extension sequence I gave

Here, giving the reverse-complement of the extension sequence, so GTCGTTCTTATCCACGTTTCCAGT (in command below) instead of ACTGGAAACGTGGATAAGAACGAC (in first command above) works:

CRISPResso --fastq_r1 ./R1.fastq.gz --fastq_r2 ./R2.fastq.gz --amplicon_seq ATTCAGTCAGGACGGCAAACTGGATGGTGCCATGGCCCTGAATGCGGTCTGGAGGGTTGCGCACATAGCAGTTGTAGATGCCCTCGTCTGTCAGCTGCACGTCCGAGATGGTGATGGAAAGGTCGTTCTTATCCAGGTTTCCAGTAAACATGACTCGCTCTCCAAATCGACTGGGCTTCAAGGGTATCATCTTATAGGTGTTC --amplicon_name test --prime_editing_pegRNA_spacer_seq TGGAAAGGTCGTTCTTATCC --prime_editing_pegRNA_extension_seq GTCGTTCTTATCCACGTTTCCAGT --prime_editing_pegRNA_scaffold_seq GTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC --output_folder ./crispressoOut

But contradicts the instruction, that extension sequence is found in the amplicon.


EDIT: to add to the mystery, if I put a slightly longer version of the extension sequence, but still in the direction as in my first command above:

ACTGGAAACGTGGATAAGAACGAC--- # first extension sequence
ACTGGAAACGTGGATAAGAACGACCTT # new extension sequence

So command looking like:

CRISPResso --fastq_r1 ./R1.fastq.gz --fastq_r2 ./R2.fastq.gz --amplicon_seq ATTCAGTCAGGACGGCAAACTGGATGGTGCCATGGCCCTGAATGCGGTCTGGAGGGTTGCGCACATAGCAGTTGTAGATGCCCTCGTCTGTCAGCTGCACGTCCGAGATGGTGATGGAAAGGTCGTTCTTATCCAGGTTTCCAGTAAACATGACTCGCTCTCCAAATCGACTGGGCTTCAAGGGTATCATCTTATAGGTGTTC --amplicon_name test --prime_editing_pegRNA_spacer_seq TGGAAAGGTCGTTCTTATCC --prime_editing_pegRNA_extension_seq ACTGGAAACGTGGATAAGAACGACCTT --prime_editing_pegRNA_scaffold_seq GTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGC --output_folder ./crispressoOut

Then it works! How odd. Am I doing something stupid?

kclem commented 2 months ago

Hi @francoiskroll,

Yes, I need to probably make this a little more intuitive.

When I run into these problems, I run with the --debug flag which will print more information out about the alignments and checking process.

We expect that the spacer will align to the extension sequence in the forward orientation, and suspect that something may be wrong with the input if it aligns to the reverse. In your case, I get this:

INFO  @ Fri, 26 Apr 2024 10:49:40:
         pegRNA spacer vs extension_seq alignment:
Forward (correct orientation):
TGGAAAGGTCGTTCTTATCC-----------
-------GTCGTTCTTATCCACGTTTCCAGT
Score: 41.935
Reverse (incorrect orientation):
--TGGAAAGGTCGTTCTTATCC--
ACTGGAAACGTGGATAAGAACGAC
Score: 50.0

As you can see, the alignment is correct in the Forward (correct orientation) but the score is lower than the Reverse (incorrect orientation) so it throws this error.

If you're sure you're providing the parameters in the correct orientation, you can get around this by providing the parameter --prime_editing_override_sequence_checks

It looks like we may need to weight mismatches more strongly negatively in this check.

francoiskroll commented 2 months ago

Thanks for looking into it. Clearly you got the diagnosis exactly right.

Yes, in this case you can see the Reverse alignment does not make much sense but for some reason it scores highly. Maybe you also need to penalise gaps less strongly? That is, I think we always expect the spacer & extension sequence to align with a shift (cf. that Forward alignment), but the current scoring seems to prefer many mismatches over gaps. Maybe I'm just repeating what you said.

I'll run with --prime_editing_override_sequence_checks for now (after doing my own manual check). Thank you for your help.