Open schmidt73 opened 3 years ago
Hi @schmidt73 - if your guides are targeting coding sequences, the easiest/quickest solution might be to use the web tool at https://crispr.ml/ . Searching for a gene there should give you all the valid guides targeting that gene, and if you click on a guide you can download a CSV with all of its near-match off-target sequences and precomputed model scores.
If you want to generate scores for arbitrary (possibly non-coding) guides, or for lots of guides (not feasible to search for all of them by hand), your best bet is to set up the model. This should be doable without the raw data, as long as you have the pickled versions of the processed data in the tmp
directory (these are all stored using Git LFS, so you'll need to install that first if you don't have it already; this should really be in the README but it isn't currently).
Once you have those files, and you've installed all the Python dependencies listed here, you should be able to run this code snippet from a script or Python shell: https://github.com/microsoft/Elevation#guide-sequence-prediction . That should give you the general idea of how you can apply the trained model to your own guide sequences. It shouldn't take long to run (a minute or two at most) - if it takes longer, there's probably an issue loading the saved data.
Let me know if you run into issues or have questions. It would be nice if this process was less of a pain, but this repo isn't really being actively maintained (all of us have moved on to other positions or other projects) so we unfortunately haven't addressed the usability much.
@jjc2718 - thanks for the help. I am running the model on my own set of about 2 million (gRNA, off-targets) pairs. So I think it will be difficult to do that all through the web portal.
I think everything should work properly, however, I'm running into an unsolvable dependency error:
shell ERROR: azimuth 2.0 has requirement scikit-learn<0.18,>=0.17.1 ERROR: elevation 1.0.0 has requirement scikit-learn>=0.18
That is, azimuth requires scikit-learn
version to be less than 0.18, and elevation requires it to be version greater than or equal to 0.18.
Should I open a new issue for this bug? I know that this repo isn't actively maintained, so if it is too much of a hassle, I can forego trying to compare against it directly.
Hope all is well.
Got it - yeah, 2 million pairs is definitely too many to do by hand.
That is, azimuth requires scikit-learn version to be less than 0.18, and elevation requires it to be version greater than or equal to 0.18.
I don't think you should need azimuth unless you're re-training the model, so if you download scikit-learn 0.18 separately (e.g. through pip or conda) that may work.
If I remember correctly, the exact order of installations in the "install/develop" section of the README is what worked for me (which is why I listed them there rather than just including a conda environment.yml
or something simpler like that). I haven't tried to set things up from scratch recently, though - I'll take a look today.
Should I open a new issue for this bug? I know that this repo isn't actively maintained, so if it is too much of a hassle, I can forego trying to compare against it directly.
Sure, feel free to create an issue. We were aware of the conflicting requirements, but never had the time to fix things when we were trying to get the software out.
Perfect. I'll try to get it running without Azimuth then - you're correct in assuming I don't want to retrain the model.
Thanks for the help again.
Sounds good!
In case it helps, I spent some time this afternoon going through the installation process on my own. It is definitely more complicated than I remember, but I was able to get the model up and running starting from a clean Conda environment on my Linux system. I attached a list of the commands I ran and the order in which I ran them (this unfortunately matters) to set up the environment.
You'll also need to download the .pkl
files in tmp
, elevation/saved_models
, and tests/fixtures
. These can either be downloaded using git lfs
when you clone the repo, or directly from GitHub (click on view raw
and they'll download).
Once you do both of those things, the tests should run successfully (using, e.g. nosetests tests
from the root directory), and you should be able to run the sample code in the README. You'll want to use the linear-raw-stacker
output to rank your guide/offtarget pairs (higher value = higher likelihood of off-target activity); CFD
was an older method we were comparing against.
Hope this helps - let me know if you have questions.
Okay, so everything installed correctly. However, I am unable to run both the test suite and the sample code.
It stems from not having the Excel files available for download.
Here is the stacktrace when running the sample code:
from get_or_compute reading cached pickle /home/schmidt73/Desktop/Elevation/tmp/base_model.pkl
elevation/util.py:42: UserWarning: Failed to load /home/schmidt73/Desktop/Elevation/tmp/base_model.pkl
warn("Failed to load %s" % file)
elevation/util.py:43: UserWarning: Recomputing. This may take a while...
warn("Recomputing. This may take a while...")
Received option CV=False, so I'm training using all of the data
running AdaBoost, order 1 for final
Launching 8 jobs with 3 MKL threads each
reading and featurizing CD33 data...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "elevation/cmds/predict.py", line 46, in __init__
self.base_model = self.get_base_model()
File "elevation/cmds/predict.py", line 88, in get_base_model
force_compute=force_compute
File "elevation/util.py", line 45, in get_or_compute
result = fargpair[0](*fargpair[1])
File "elevation/prediction_pipeline.py", line 62, in train_base_model
set_target_fn=set_target_elevation, pam_audit=False, length_audit=False)
File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/azimuth/model_comparison.py", line 325, in run_models
Y, feature_sets, target_genes, learn_options, num_proc = setup_function(test=test, order=order, learn_options=partial_learn_opt, pam_audit=pam_audit, length_audit=length_audit) # TODO precompute features for all orders, as this is repated for each model
File "elevation/model_comparison.py", line 58, in setup_elevation
data, Y, target_genes = elevation.load_data.load_cd33(learn_options)
File "elevation/load_data.py", line 206, in load_cd33
data_filt = pandas.read_excel(data_file_filt, index_col=[0], parse_cols=range(1,9))
File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/pandas/io/excel.py", line 191, in read_excel
io = ExcelFile(io, engine=engine)
File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/pandas/io/excel.py", line 249, in __init__
self.book = xlrd.open_workbook(io)
File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/xlrd/__init__.py", line 394, in open_workbook
f = open(filename, "rb")
IOError: [Errno 2] No such file or directory: '/home/schmidt73/Desktop/Elevation/CRISPR/data/offtarget/CD33_data_postfilter.xlsx'
Because the files are no longer available from Nature, perhaps you could just attach them to this issue? I think once I have them, everything will be resolved.
Thank you for your help.
Sure - here are the Excel files I have:
STable 19 FractionActive_dlfc_lookup.xlsx Supplementary Table 10.xlsx CD33_data_postfilter.xlsx nbt.3117-S2.xlsx STable 18 CD33_OffTargetdata.xlsx
I'm not sure why it's trying to recompute parts of the model (the warnings at the top of your output), but give it a try with those files in the CRISPR/data/offtarget
directory and see if the tests pass.
If you can't get them to pass, we can try to figure out why it's not able to load the precomputed pickle files, but I think the output should be the same either way.
Worked for me Thanks. Just notice that the 'STable 19 FractionActive_dlfc_lookup.xlsx' file is download as 'STable.19.FractionActive_dlfc_lookup.xlsx' and need to be changed.
I was having trouble running the script:
CRISPR/download_data.sh
. In debugging, it turns out that the tables are no longer hosted at the URI in the script.See: https://images.nature.com/original/nature-assets/nbt/journal/v33/n2/extref/nbt.3117-S2.xlsx
Is it necessary that I download these tables to run the model on my own
(gRNA, off-targets)
pairs? All I would like to do is run the model on my own data. I have no interest in replicating the results from the paper.