hkimlab / DeepPrime

Source codes and examples for DeepPrime
10 stars 3 forks source link

Sharing the training data #3

Closed Davood-Norouzi closed 1 year ago

Davood-Norouzi commented 1 year ago

Hi,

I enjoyed reading your paper in Cell. The high-throughput analysis of >300,000 guide RNAs with deep learning is impressive.

Is there any chance, you could share with us that prime editing efficiencies of 338,996 pairs of pegRNAs and targets data, so I could test my own ML models on it. It is for educational purposes for our summer interns. I would really appreciate it, if you could share your data.

Thank you, -Davood

Goosang-Yu commented 1 year ago

Hi Davood,

Thank you for showing interest in our research. We measured the efficiency of over 300,000 prime editing guide RNAs (pegRNAs). However, each model was trained on a specific dataset.

For example, the DeepPrime (base) model utilized approximately 250K training data. You can find all the information by downloading Supplementary Table 2 from our paper (Cell, 2023).

We are delighted that our data is being used for educational purposes, and we hope it proves helpful. If you need any further information, feel free to open an additional issue or email me at gsyu93@gmail.com, and I will do my best to assist you.

I will close this issue here.

Have a great day!

Best regards, Goosang

Davood-Norouzi commented 1 year ago

Thanks for your kind and swift response!

Could you please let me know what some of the columns in Table S2 stand for? I am not sure if I understand what these columns represent: Tm1, Tm2, Tm2new, Tm3, Tm4, TmD, nGCcnt1, nGCcnt2, nGCcnt3, fGCcont1, fGCcont2, fGCcont3, MFE3, MFE4, and Fold. Apologies for the long list of confusion. I am confused as to why some Tm values are negative.

Looking forward to hearing from you, -Davood

Goosang-Yu commented 1 year ago

Each feature represents the following information:

Tm1 = Tm of PBS
Tm2 = Tm of Target DNA region corresponding to RT template
Tm3 = Tm of Reverse transcribed cDNA and PAM-opposite DNA strand
Tm4 = Tm of RT template region and reverse transcribed cDNA
TmD = delta Tm; Tm3 - Tm2

nGCcnt1 = GC count of PBS nGCcnt2 = GC count of RTT nGCcnt3 = GC count of RT-PBS

fGCcont1 = GC contents of PBS fGCcont2 = GC contents of RTT fGCcont3 = GC contents of RT-PBS

MFE3 = MFE of RT-PBS + PolyT sequence MFE4 = MFE of Spacer sequence

Davood-Norouzi commented 1 year ago

Thanks a lot! Are Tm values re-centered around a mean? Is that why they can be negative?

Goosang-Yu commented 1 year ago

In the case of very short length and low GC content, the Tm value can be negative.

When training the DeepPrime model, all data was first normalized, and batch normalization steps were also included between neural network layers.