Closed Davood-Norouzi closed 1 year ago
Hi Davood,
Thank you for showing interest in our research. We measured the efficiency of over 300,000 prime editing guide RNAs (pegRNAs). However, each model was trained on a specific dataset.
For example, the DeepPrime (base) model utilized approximately 250K training data. You can find all the information by downloading Supplementary Table 2 from our paper (Cell, 2023).
We are delighted that our data is being used for educational purposes, and we hope it proves helpful. If you need any further information, feel free to open an additional issue or email me at gsyu93@gmail.com, and I will do my best to assist you.
I will close this issue here.
Have a great day!
Best regards, Goosang
Thanks for your kind and swift response!
Could you please let me know what some of the columns in Table S2 stand for? I am not sure if I understand what these columns represent: Tm1, Tm2, Tm2new, Tm3, Tm4, TmD, nGCcnt1, nGCcnt2, nGCcnt3, fGCcont1, fGCcont2, fGCcont3, MFE3, MFE4, and Fold. Apologies for the long list of confusion. I am confused as to why some Tm values are negative.
Looking forward to hearing from you, -Davood
Each feature represents the following information:
Tm1 = Tm of PBS
Tm2 = Tm of Target DNA region corresponding to RT template
Tm3 = Tm of Reverse transcribed cDNA and PAM-opposite DNA strand
Tm4 = Tm of RT template region and reverse transcribed cDNA
TmD = delta Tm; Tm3 - Tm2
nGCcnt1 = GC count of PBS nGCcnt2 = GC count of RTT nGCcnt3 = GC count of RT-PBS
fGCcont1 = GC contents of PBS fGCcont2 = GC contents of RTT fGCcont3 = GC contents of RT-PBS
MFE3 = MFE of RT-PBS + PolyT sequence MFE4 = MFE of Spacer sequence
Thanks a lot! Are Tm values re-centered around a mean? Is that why they can be negative?
In the case of very short length and low GC content, the Tm value can be negative.
When training the DeepPrime model, all data was first normalized, and batch normalization steps were also included between neural network layers.
Hi,
I enjoyed reading your paper in Cell. The high-throughput analysis of >300,000 guide RNAs with deep learning is impressive.
Is there any chance, you could share with us that prime editing efficiencies of 338,996 pairs of pegRNAs and targets data, so I could test my own ML models on it. It is for educational purposes for our summer interns. I would really appreciate it, if you could share your data.
Thank you, -Davood