microsoft / Elevation

End-to-end guide design for CRISPR/Cas9 with machine learning
MIT License
125 stars 34 forks source link

Supplementary tables not available #6

Open schmidt73 opened 3 years ago

schmidt73 commented 3 years ago

I was having trouble running the script: CRISPR/download_data.sh. In debugging, it turns out that the tables are no longer hosted at the URI in the script.

See: https://images.nature.com/original/nature-assets/nbt/journal/v33/n2/extref/nbt.3117-S2.xlsx

Is it necessary that I download these tables to run the model on my own (gRNA, off-targets) pairs? All I would like to do is run the model on my own data. I have no interest in replicating the results from the paper.

jjc2718 commented 3 years ago

Hi @schmidt73 - if your guides are targeting coding sequences, the easiest/quickest solution might be to use the web tool at https://crispr.ml/ . Searching for a gene there should give you all the valid guides targeting that gene, and if you click on a guide you can download a CSV with all of its near-match off-target sequences and precomputed model scores.

If you want to generate scores for arbitrary (possibly non-coding) guides, or for lots of guides (not feasible to search for all of them by hand), your best bet is to set up the model. This should be doable without the raw data, as long as you have the pickled versions of the processed data in the tmp directory (these are all stored using Git LFS, so you'll need to install that first if you don't have it already; this should really be in the README but it isn't currently).

Once you have those files, and you've installed all the Python dependencies listed here, you should be able to run this code snippet from a script or Python shell: https://github.com/microsoft/Elevation#guide-sequence-prediction . That should give you the general idea of how you can apply the trained model to your own guide sequences. It shouldn't take long to run (a minute or two at most) - if it takes longer, there's probably an issue loading the saved data.

Let me know if you run into issues or have questions. It would be nice if this process was less of a pain, but this repo isn't really being actively maintained (all of us have moved on to other positions or other projects) so we unfortunately haven't addressed the usability much.

schmidt73 commented 3 years ago

@jjc2718 - thanks for the help. I am running the model on my own set of about 2 million (gRNA, off-targets) pairs. So I think it will be difficult to do that all through the web portal.

I think everything should work properly, however, I'm running into an unsolvable dependency error:

shell ERROR: azimuth 2.0 has requirement scikit-learn<0.18,>=0.17.1 ERROR: elevation 1.0.0 has requirement scikit-learn>=0.18

That is, azimuth requires scikit-learn version to be less than 0.18, and elevation requires it to be version greater than or equal to 0.18.

Should I open a new issue for this bug? I know that this repo isn't actively maintained, so if it is too much of a hassle, I can forego trying to compare against it directly.

Hope all is well.

jjc2718 commented 3 years ago

Got it - yeah, 2 million pairs is definitely too many to do by hand.

That is, azimuth requires scikit-learn version to be less than 0.18, and elevation requires it to be version greater than or equal to 0.18.

I don't think you should need azimuth unless you're re-training the model, so if you download scikit-learn 0.18 separately (e.g. through pip or conda) that may work.

If I remember correctly, the exact order of installations in the "install/develop" section of the README is what worked for me (which is why I listed them there rather than just including a conda environment.yml or something simpler like that). I haven't tried to set things up from scratch recently, though - I'll take a look today.

Should I open a new issue for this bug? I know that this repo isn't actively maintained, so if it is too much of a hassle, I can forego trying to compare against it directly.

Sure, feel free to create an issue. We were aware of the conflicting requirements, but never had the time to fix things when we were trying to get the software out.

schmidt73 commented 3 years ago

Perfect. I'll try to get it running without Azimuth then - you're correct in assuming I don't want to retrain the model.

Thanks for the help again.

jjc2718 commented 3 years ago

Sounds good!

In case it helps, I spent some time this afternoon going through the installation process on my own. It is definitely more complicated than I remember, but I was able to get the model up and running starting from a clean Conda environment on my Linux system. I attached a list of the commands I ran and the order in which I ran them (this unfortunately matters) to set up the environment.

install_commands.txt

You'll also need to download the .pkl files in tmp, elevation/saved_models, and tests/fixtures . These can either be downloaded using git lfs when you clone the repo, or directly from GitHub (click on view raw and they'll download).

Once you do both of those things, the tests should run successfully (using, e.g. nosetests tests from the root directory), and you should be able to run the sample code in the README. You'll want to use the linear-raw-stacker output to rank your guide/offtarget pairs (higher value = higher likelihood of off-target activity); CFD was an older method we were comparing against.

Hope this helps - let me know if you have questions.

schmidt73 commented 3 years ago

Okay, so everything installed correctly. However, I am unable to run both the test suite and the sample code.

It stems from not having the Excel files available for download.

Here is the stacktrace when running the sample code:

from get_or_compute reading cached pickle /home/schmidt73/Desktop/Elevation/tmp/base_model.pkl
elevation/util.py:42: UserWarning: Failed to load /home/schmidt73/Desktop/Elevation/tmp/base_model.pkl
  warn("Failed to load %s" % file)
elevation/util.py:43: UserWarning: Recomputing. This may take a while...
  warn("Recomputing. This may take a while...")
Received option CV=False, so I'm training using all of the data
running AdaBoost, order 1 for final
Launching 8 jobs with 3 MKL threads each
reading and featurizing CD33 data...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "elevation/cmds/predict.py", line 46, in __init__
    self.base_model = self.get_base_model()
  File "elevation/cmds/predict.py", line 88, in get_base_model
    force_compute=force_compute
  File "elevation/util.py", line 45, in get_or_compute
    result = fargpair[0](*fargpair[1])
  File "elevation/prediction_pipeline.py", line 62, in train_base_model
    set_target_fn=set_target_elevation, pam_audit=False, length_audit=False)
  File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/azimuth/model_comparison.py", line 325, in run_models
    Y, feature_sets, target_genes, learn_options, num_proc = setup_function(test=test, order=order, learn_options=partial_learn_opt, pam_audit=pam_audit, length_audit=length_audit) # TODO precompute features for all orders, as this is repated for each model
  File "elevation/model_comparison.py", line 58, in setup_elevation
    data, Y, target_genes = elevation.load_data.load_cd33(learn_options)        
  File "elevation/load_data.py", line 206, in load_cd33
    data_filt = pandas.read_excel(data_file_filt, index_col=[0], parse_cols=range(1,9))
  File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/pandas/io/excel.py", line 191, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/pandas/io/excel.py", line 249, in __init__
    self.book = xlrd.open_workbook(io)
  File "/home/schmidt73/miniconda3/envs/elevation/lib/python2.7/site-packages/xlrd/__init__.py", line 394, in open_workbook
    f = open(filename, "rb")
IOError: [Errno 2] No such file or directory: '/home/schmidt73/Desktop/Elevation/CRISPR/data/offtarget/CD33_data_postfilter.xlsx'

Because the files are no longer available from Nature, perhaps you could just attach them to this issue? I think once I have them, everything will be resolved.

Thank you for your help.

jjc2718 commented 3 years ago

Sure - here are the Excel files I have:

STable 19 FractionActive_dlfc_lookup.xlsx Supplementary Table 10.xlsx CD33_data_postfilter.xlsx nbt.3117-S2.xlsx STable 18 CD33_OffTargetdata.xlsx

I'm not sure why it's trying to recompute parts of the model (the warnings at the top of your output), but give it a try with those files in the CRISPR/data/offtarget directory and see if the tests pass.

If you can't get them to pass, we can try to figure out why it's not able to load the precomputed pickle files, but I think the output should be the same either way.

udiland commented 2 years ago

Worked for me Thanks. Just notice that the 'STable 19 FractionActive_dlfc_lookup.xlsx' file is download as 'STable.19.FractionActive_dlfc_lookup.xlsx' and need to be changed.