Open ghost opened 2 years ago
And, just an append, the error/output that I get is:
(test-env) benjamin@benjamin-T460:~/owen/GPCR_LigandClassify$ python GPCR_LigandClassify.py --input_file sample_input.csv --output_file output.csv --n_rows_to_read 1200
/home/benjamin/anaconda3/envs/test-env/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
Using TensorFlow backend.
PANDAS version 0.22.0
########################
The following python librares are required to run the models:
* Python 2.7 (tested with Anaconda distribution, Linux OS (Mint 19.1 for a local PC, and 3.10.0-957.12.2.el7.x86_64 GNU/Linux for ComputeCanada clusters). Running the models on MAC may be cumbersome because of the recent XGBoost updates. We did not test the prediction on Windows.)
* DeepChem 1.x (Require RDKit)
* Pandas (Prediction is tested with Pandas 0.22)
* Tensorflow 1.3
* Keras
* XGBoost
* ScikitLearn
########################
In order for the script to run, and in addition to the input file (see below), the following files should exist in the running directory:
* dl_model_fp.json
* dl_model_fp.h5
* mlp_rdkit_classify_fp.sav
* xgb_rdkit_classify_fp.sav
* rfc_rdkit_classify_fp.sav
* svm_rdkit_classify_fp.sav
* coded_gpcr_list.csv
NB: The rfc_rdkit_classify_fp.sav & svm_rdkit_classify_fp.sav & mlp_rdkit_classify_fp.sav models are required only if the [--ignore_rf_svm argument] option in the script is set to False (True is the default behaviour)
. The models are not deposited in the github repository because of size limits, to get these two models a direct request should be sent to mmahmed@ualberta.ca & kbarakat@ualberta.ca
########################
Welcome to GPCR_LigandClassify, this is how you can use the program using the models to make novel predictions, we hope you find these predictions useful for your task:
python GPCR_LigandClassify.py --input_file input.csv --output_file output.csv [--n_rows_to_read <INTEGER>] [--mwt_lower_bound <FLOAT>] [--mwt_upper_bound <FLOAT>] [--logp_lower_bound <FLOAT>] [--logp_upper_bound <FLOAT>] [--ignore_rf_svm <True/False>]
########################
The input & output file names arguments are mandatory arguments, --n_rows_to_read argument determines how many rows you want to read from the input CSV files (default 9999999999 rows)
, the rest are optional with default same as input dataset used for models training.
########################
The --ignore_rf_svm argument will ignore the RF and the SVM models which are pretty large, suitable in case of limited computational resourcses, particularly memory. Default is True (Ignore Randomforests and SVM models.)
########################
Please note that a today date string will be attached to the output file name.
########################
Please note that the script will only save ligands where all models predictions agree.
########################
For the input file, please keep the same format as the attached sample input file. In case of data coming from different source, you can populate the rest of columns with fake data.
With the exception of the SMILES column, other columns may be left blank (not recommended).
########################
For the models and auxiliary files, please visit the following github repository:
https://github.com/mmagithub/GPCR_LigandClassify
########################
inputfile: /home/benjamin/owen/GPCR_LigandClassify/sample_input.csv
outputfile: /home/benjamin/owen/GPCR_LigandClassify/output_2022-10-07.csv
GPCR_LigandClassify.py:207: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
drug_bank_df_selected_cols.dropna(subset = ['smiles'],inplace=True)
Featurizing sample 0
[21:04:14] Explicit valence for atom # 2 O, 3, is greater than permitted
RDKit ERROR: [21:04:14] Explicit valence for atom # 2 O, 3, is greater than permitted
[21:04:17] Explicit valence for atom # 0 N, 4, is greater than permitted
RDKit ERROR: [21:04:17] Explicit valence for atom # 0 N, 4, is greater than permitted
Featurizing sample 1000
[21:04:30] Explicit valence for atom # 2 O, 3, is greater than permitted
RDKit ERROR: [21:04:30] Explicit valence for atom # 2 O, 3, is greater than permitted
[21:04:31] Explicit valence for atom # 0 N, 4, is greater than permitted
RDKit ERROR: [21:04:31] Explicit valence for atom # 0 N, 4, is greater than permitted
/home/benjamin/anaconda3/envs/test-env/lib/python2.7/site-packages/rdkit/Chem/PandasTools.py:302: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
lambda smiles: _MolPlusFingerprint(Chem.MolFromSmiles(smiles)))
GPCR_LigandClassify.py:221: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
drug_bank_df_selected_cols_featurized_filtered.dropna(inplace=True)
2022-10-07 21:04:34.752815: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model from disk
dl_model_fp prediction made
/home/benjamin/anaconda3/envs/test-env/lib/python2.7/site-packages/sklearn/preprocessing/label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
xgb_fp prediction made
Index([u'index', u'drugbank_id', u'name', u'smiles', u'pubchem_substance_id', u'drug_groups', u'prediction_class_dl_model_fp_prediction', u'prediction_class_dl_model_fp_prediction_proba', u'prediction_class_xgb_fp_prediction', u'prediction_class_xgb_fp_prediction_proba', u'Unnamed: 0', u'gpcr_name', u'first_seg', u'second_seg', u'gpcr_binding_encoded'], dtype='object')
Traceback (most recent call last):
File "GPCR_LigandClassify.py", line 308, in <module>
merged_predictions_selcols = merged_predictions_fullcols[['drugbank_id', 'name', u'pubchem_substance_id', 'drug_groups', 'prediction_class_dl_model_fp_prediction', 'prediction_class_dl_model_fp_prediction_proba', 'prediction_class_mlp_fp_prediction', 'prediction_class_mlp_fp_prediction_proba', 'prediction_class_xgb_fp_prediction', 'prediction_class_xgb_fp_prediction_proba', 'prediction_class_rfc_fp_prediction', 'prediction_class_rfc_fp_prediction_proba', 'prediction_class_svm_fp_prediction', 'prediction_class_svm_fp_prediction_proba', 'first_seg','gpcr_binding_encoded']]
File "/home/benjamin/anaconda3/envs/test-env/lib/python2.7/site-packages/pandas/core/frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "/home/benjamin/anaconda3/envs/test-env/lib/python2.7/site-packages/pandas/core/frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/home/benjamin/anaconda3/envs/test-env/lib/python2.7/site-packages/pandas/core/indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "[u'prediction_class_mlp_fp_prediction'\n u'prediction_class_mlp_fp_prediction_proba'\n u'prediction_class_rfc_fp_prediction'\n u'prediction_class_rfc_fp_prediction_proba'\n u'prediction_class_svm_fp_prediction'\n u'prediction_class_svm_fp_prediction_proba'] not in index"
Hey! I've been trying to run this script, although I am getting some troubles, which I suspect it may be due to a divergence on libraries versions.
DeepChem 1.x (Requires RDKit) Pandas (Prediction is tested with Pandas 0.22) Tensorflow 1.3 Keras XGBoost ScikitLearn
The libraries that I have installed are (pip):
I am running the script in a conda venv,
Thank you very much for your time. I appreciate that even with this repository deactivate, you still helping the community :)