uzh-dqbm-cmi / PRIDICT

Prime editing guide RNA prediction
https://pridict.it/
MIT License
7 stars 5 forks source link

output dir not functioning #5

Closed JAMKuttan closed 9 months ago

JAMKuttan commented 1 year ago

Hey there again. I got one more for you.

If I pass the --output-dir into my call, it doesnt write it to that location on my computer. It also doesnt make a folder in the directory I am in either. Maybe I am misunderstand how this is supposed to be used?

mathinic commented 1 year ago

Hi again!

Did you create the folder already before running the command? The tool doesn't currently create the folder if it doesn't exist. I have updated the README to mention that the folder needs to be created first.

I ran the command python pridict_pegRNA_design.py batch --input-fname design.csv --use_5folds --nicking --output-dir ./new_predictions after creating the new_predictions folder and can confirm that it successfully saved the output in there.

JAMKuttan commented 1 year ago

Okay that is really good to know. I am having a new issue now. It seems to not want to produce any results anymore. Anytime I run this command I end up with an empty predictions folder.

python pridict_pegRNA_design.py batch --input-fname design.csv --use_5folds --nicking

Calculating features took 21.9 seconds to run. Deep model took 69.8 seconds to run. -- Exception occured -- Length of values (2287) does not match length of index (459) 1it [00:14, 14.69s/it] 1it [00:19, 19.98s/it]

Calculating features took 22.4 seconds to run. Deep model took 73.1 seconds to run. -- Exception occured -- Length of values (2291) does not match length of index (459) 1it [00:12, 12.50s/it]

Calculating features took 22.0 seconds to run. Deep model took 74.2 seconds to run. -- Exception occured -- Length of values (2289) does not match length of index (459) 1it [00:04, 4.62s/it] 1it [00:01, 1.86s/it]

Calculating features took 33.5 seconds to run. Deep model took 65.8 seconds to run. -- Exception occured -- Length of values (3052) does not match length of index (612) <<< joined row computation process <<< joined row computation process <<< joined row computation process <<< joined row computation process <<< joined row computation process

JAMKuttan commented 1 year ago

So to follow up on the comment above. I did a bunch of testing with different combinations. I believe there is something wrong with the "--use_5folds". What exactly does that algo do that the normal one does? When I remove that flag from my command, I get the csv files that I expect.

mathinic commented 1 year ago

Is this the same design.csv file you used earlier? It's peculiar because it worked seamlessly with the previous command on my system (python pridict_pegRNA_design.py batch --input-fname design.csv --use_5folds --nicking --output-dir ./new_predictions). The --use_5folds option enables the algorithm to aggregate predictions from all 5 trained folds, theoretically enhancing the accuracy of the final prediction. However, in practice, the option is seldom necessary (and I myself rarely use it) as the variation between different folds is negligible (refer to Fig. 2b for Spearman correlations) and it consumes more time. It's worth mentioning that the default setting employs fold 1, which is consistent with the approach on the pridict.it website.

Nevertheless, the error should not occur and I will take a closer look on how to resolve it! Thanks for bringing it up!

JAMKuttan commented 1 year ago

I am pretty sure it is the same input file. Here is the file again. design.csv

But that is good to know. I am find with just using the regular algorithm. Especially if in practice you don't see much variation. That is also probably reassuring in a way.

JAMKuttan commented 1 year ago

Another question for you. How difficult would it be to alter your code to design guides for enzymes other than SpCas9?

mathinic commented 1 year ago

You could try to tweak the code for other enzymes by modifying the editorcharacteristics function on line 187 in pridict_pegRNA_design.py (for example, changing PAM to '(?=G)'). However, since it’s exclusively trained on SpCas9 and hasn't been validated for other enzymes, the reliability of the results is unknown. Keep in mind that, for instance, the modified model might inaccurately assess the efficiency of altering n in NGn in an NG-Cas9 context, because in the training dataset (NGG) this position-change was positively associated with editing.