microsoft / Elevation

End-to-end guide design for CRISPR/Cas9 with machine learning
MIT License
127 stars 35 forks source link

Installation instruction needs some updates, + questions #2

Open purplerainf opened 6 years ago

purplerainf commented 6 years ago

Hi. I have installed the Elevation to my virtual machine and I would like to share my experience.

The installation instruction is easy to follow and detailed, but there are some issues that are not mentioned.

1) csh is not supported. I was able to install and run the Elevation using bash shell, in a virtual machine. (especially 'source activate' command does not work in csh) 2) azure and azure-storage are required to run guideseq.py 3) pymysql and azimuth are required by dsNickFury. So I had to run the following commands after creating python 3 environment. dependencies/anaconda2/envs/dsNickFury/bin/pip install pymysql dependencies/anaconda2/envs/dsNickFury/bin/pip install azimuth 4) Sometimes the 'git clone' command does not copy pickle files properly. So I checked the size of all pickle files and downloaded manually if the files were damaged. 5) There are a few typos "dsNickFury/dependencies/anaconda2/bin/conda create -n dsNickFury python==3" should be "dsNickFury/dependencies/anaconda2/bin/conda create -n dsNickFury python=3" "roc_data, roc_Y_bin, roc_Y_vals = elevation.load_data.load_HauesslerFig2()" should be "roc_data, roc_Y_bin, roc_Y_vals = elevation.load_data.load_HauesslerFig2(1)"

After that, I was able to run the example scripts successfully. But still, I can't fully understand the example. Below is the output of the Aggregation Prediction Example. I printed the first 10 (wildtype, offtarget, and prediction)s.

/home/yeuy/anaconda2/envs/elevation/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) /home/yeuy/anaconda2/envs/elevation/lib/python2.7/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning) loading hauessler version 1 from get_or_compute reading cached pickle /home/yeuy/tools/git/Elevation/tmp/base_model.pkl from get_or_compute reading cached pickle /home/yeuy/tools/git/Elevation/tmp/guideseq_data.pkl from get_or_compute reading cached pickle /home/yeuy/tools/git/Elevation/tmp/gspred.pkl from get_or_compute reading cached pickle /home/yeuy/tools/git/Elevation/tmp/cd33.pkl from get_or_compute reading cached pickle /home/yeuy/tools/git/Elevation/tmp/calibration_models.pkl Time spent loading pickles: 23.1609959602 Time spent parsing input: 0.00233101844788 predict_elevation allocating 28 cores start_range=0, end_range=99 predict_elevation: 0.00 perc. done (0 of 100 using block_size=10000) Time spent in base model predict(): 2.81073284149 train data set size is N=709 /home/yeuy/anaconda2/envs/elevation/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py:1082: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /home/yeuy/anaconda2/envs/elevation/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning) Time spent in stacked_predictions: 2.33333802223 ('TGGATGGAGGAATGAGGAGTTGG', 'AGGAAGGATGACTGAGGAGTGAG', ['CFD=[0.01819363]', 'linear-raw-stacker=1.6875747876411826e-07']) ('GGTGAGTGAGTGTGTGCGTGTGG', 'CGTGTGTGCGTGTGTGCGTGTGG', ['CFD=[0.14842301]', 'linear-raw-stacker=0.00013041367286805436']) ('GGTGAGTGAGTGTGTGCGTGTGG', 'TGTGTATGAGTGTGTGGGTGTAG', ['CFD=[0.00554565]', 'linear-raw-stacker=1.6679010889233097e-07']) ('GCCTCCCCAAAGCCTGGCCAGGG', 'GCTTCCCCAGTGCCTGGACATGG', ['CFD=[0.06328074]', 'linear-raw-stacker=4.70918702105513e-06']) ('GACCCCCTCCACCCCGCCTCCGG', 'GAGCCACTGCACCCAGCCTCTAG', ['CFD=[0.01655889]', 'linear-raw-stacker=3.011665011162303e-07']) ('GGTGAGTGAGTGTGTGCGTGTGG', 'GGTTTGTGTGTGTGTGTGTGTGG', ['CFD=[0.03702479]', 'linear-raw-stacker=7.715443862857503e-07']) ('GACTTGTTTTCATTGTTCTCAGG', 'GATTTGTGTTGATTGTTGTCAGG', ['CFD=[0.01680556]', 'linear-raw-stacker=3.4906015350337787e-06']) ('AAATGAGAAGAAGAGGCACAGGG', 'AAAGGTGAAGAAGGGACACAAAG', ['CFD=[0.05401235]', 'linear-raw-stacker=4.83568059751788e-07']) ('CCAGTGAGTAGAGCGGAGGCAGG', 'CCAGTGAGGAGAGAGGGAGCAGG', ['CFD=[0.02647059]', 'linear-raw-stacker=6.776440644019873e-07']) ('AAATGAGAAGAAGAGGCACAGGG', 'AAAAGAAAAGAAGAGGAATATGG', ['CFD=[0.1025641]', 'linear-raw-stacker=4.048623509979716e-06']) [0.50926125]

My questions are: 1) What is the meaning of the final score 0.50926125? Does a higher score mean a low off-target activity? 2) In the example, every wildtype (==guide RNA sequence) has a different sequence but the scores are aggregated. I don't think this is a normal situation. For a general use, I guess I should give the same wildtype and different offtargets. For example, (WT1 - OT1), (WT1 - OT2), (WT1 - OT3), ... Am I correct? 3) I guess the 'isgenic' parameter is a 1/0 numpy array which represents whether each offtarget is on coding region or not. Am I correct?

I look forward to hearing from you. Thank you very much.