Question regarding PRIDICT Sequences Format and Source Code for Retraining

SophiaNt97 commented 1 year ago

I am currently working on reproducing the results of the following work: 'Predicting the efficiency of prime editing guide RNAs in human cells.' To run the PRIDICT tool, the format of the sequence used as input should have the following characteristics: 'xxxxxxxxx(a/g)xxxxxxxxxx.' A minimum of 100 bases up and downstream of the brackets are needed. Unchanged edit-flanking bases should be placed outside of the brackets (e.g., xxxT(a/g)Cxxx instead of xxx(TAC/TGC)xxx).

I have noticed that in library 2 of the article's datasets, the PRIDICT sequence format(excel sheet) given does not follow this criterion. I would like to be informed if the sequences are being extended through the training process. If they are, would it be possible to send me the source code to retrain the tool?

Thank you for your time!

mathinic commented 1 year ago

Hi Sophia,

You've made a sharp observation regarding the unusual PRIDICT sequence format in the Excel sheet from library 2. This column is from earlier stages of the project and the PRIDICT sequence format is also not what we use for training the model.

Here's some context: We introduced the (artificial) 100 bp limit later (after training phase), to make sure that we get enough sequence context from users to look for suitable PAMs. That's why the column in library 2 is shorter, since the context was not needed there (suitable PAM close to edit). If you check out the code in pridict_pegRNA_design.py, you can see that we use the PRIDICT format (xxx(A/T)xxx) only to look for suitable PAMs and then design the needed features for each pegRNA. After this, the features (and not the PRIDICT format) are used as input to the model for predicting efficiency (see function deeppridict on line 317).

Regarding your question for the training code, we are currently working on an update to include additional raw code for the training itself, but this still needs some refactoring to make it usable for others.

Feel free to reach out with any further questions.

Best, Nicolas

PeihengLu commented 5 months ago

Hi! Quick follow up! Is there any update on the training code?

Thanks! Peiheng

mathinic commented 3 months ago

Hi Peiheng, Sorry for the late reply! We have added cleaned up training code for the updated model PRIDICT2.0 (access it here) where you'll find a few notebooks and can run the training workflow within them. Hope this helps!

Best, Nicolas

uzh-dqbm-cmi / PRIDICT

Question regarding PRIDICT Sequences Format and Source Code for Retraining #6