Closed SophiaNt97 closed 3 months ago
Hi Sophia,
You've made a sharp observation regarding the unusual PRIDICT sequence format in the Excel sheet from library 2. This column is from earlier stages of the project and the PRIDICT sequence format is also not what we use for training the model.
Here's some context:
We introduced the (artificial) 100 bp limit later (after training phase), to make sure that we get enough sequence context from users to look for suitable PAMs. That's why the column in library 2 is shorter, since the context was not needed there (suitable PAM close to edit).
If you check out the code in pridict_pegRNA_design.py, you can see that we use the PRIDICT format (xxx(A/T)xxx) only to look for suitable PAMs and then design the needed features for each pegRNA. After this, the features
(and not the PRIDICT format) are used as input to the model for predicting efficiency (see function deeppridict
on line 317).
Regarding your question for the training code, we are currently working on an update to include additional raw code for the training itself, but this still needs some refactoring to make it usable for others.
Feel free to reach out with any further questions.
Best, Nicolas
Hi! Quick follow up! Is there any update on the training code?
Thanks! Peiheng
I am currently working on reproducing the results of the following work: 'Predicting the efficiency of prime editing guide RNAs in human cells.' To run the PRIDICT tool, the format of the sequence used as input should have the following characteristics: 'xxxxxxxxx(a/g)xxxxxxxxxx.' A minimum of 100 bases up and downstream of the brackets are needed. Unchanged edit-flanking bases should be placed outside of the brackets (e.g., xxxT(a/g)Cxxx instead of xxx(TAC/TGC)xxx).
I have noticed that in library 2 of the article's datasets, the PRIDICT sequence format(excel sheet) given does not follow this criterion. I would like to be informed if the sequences are being extended through the training process. If they are, would it be possible to send me the source code to retrain the tool?
Thank you for your time!