Closed mattvan83 closed 4 years ago
Been wanting to do add Linear SVM and Naive Bayes to supported classifiers, adding this to todo list :).
scikit-learn does provide probabilities for some classifiers - I could try output them as well.
thanks for your suggestions and bug reports.
I tried to add Linear SVM in algorithms.py
code, then updated config.neuropredict.py
to add new classifier. However when launching the command line I still got the error:
usage: neuropredict [-h] [-m META_FILE] [-o OUT_DIR] [-f FS_SUBJECT_DIR]
[-y PYRADIGM_PATHS [PYRADIGM_PATHS ...]]
[-u USER_FEATURE_PATHS [USER_FEATURE_PATHS ...]]
[-d DATA_MATRIX_PATHS [DATA_MATRIX_PATHS ...]]
[-a ARFF_PATHS [ARFF_PATHS ...]] [-p POSITIVE_CLASS]
[-t TRAIN_PERC] [-n NUM_REP_CV]
[-k NUM_FEATURES_TO_SELECT]
[-sg [SUB_GROUPS [SUB_GROUPS ...]]]
[-g {none,light,exhaustive}]
[-is {median,mean,most_frequent,raise}]
[-fs {selectkbest_mutual_info_classif,selectkbest_f_classif,variancethreshold}]
[-e {randomforestclassifier,extratreesclassifier,decisiontreeclassifier,svm,xgboost}]
[-z MAKE_VIS] [-c NUM_PROCS] [--po PRINT_OPT_DIR] [-v]
neuropredict: error: argument -e/--classifier: invalid choice: 'linearsvc' (choose from 'randomforestclassifier', 'extratreesclassifier', 'decisiontreeclassifier', 'svm', 'xgboost')
I have certainly missed one place but where?
I reached advanced error stage since the previous message. I managed to launch my own neuropredict with LinearSVC but got the following error message apprently linked to multiprocessing:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/rhst.py", line 690, in holdout_trial_compare_datasets
average='weighted')
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 355, in roc_auc_score
sample_weight=sample_weight)
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/base.py", line 76, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 327, in _binary_roc_auc_score
sample_weight=sample_weight)
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 622, in roc_curve
y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 402, in _binary_clf_curve
assert_all_finite(y_score)
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/utils/validation.py", line 72, in assert_all_finite
_assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/bin/neuropredict", line 11, in <module>
load_entry_point('neuropredict', 'console_scripts', 'neuropredict')()
File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/__main__.py", line 11, in main
run_workflow.cli()
File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/run_workflow.py", line 1049, in cli
grid_search_level, classifier, feat_select_method)
File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/run_workflow.py", line 1024, in prepare_and_run
options_path=options_path)
File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/rhst.py", line 422, in run
cv_results = pool.map(partial_func_holdout, range(num_repetitions))
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 288, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 670, in get
raise self._value
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The ouput message was stucked at the parallelizing step:
Python 3.6.7
SGE recognized, job set up with 35 slots.
Running neuropredict 0.5+34.g220af55.dirty
Requested features for analysis:
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/metaROI.csv
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/metaROI_split.csv
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/HCP_parcellation.csv
Ignoring imputation strategy chosen, as no missing data were found!
Data import is done.
Requested processing for the following subgroups:
CN,MCI
CN,AD
MCI,AD
--------------------------------------------------------------------------------
Processing subgroup : CN,MCI (1/3)
--------------------------------------------------------------------------------
SGE recognized, job set up with 35 slots.
Training percentage : 0.8
Number of CV repetitions : 250
Classifier chosen : linearsvc
Feature selection chosen : variancethreshold
Level of grid search : exhaustive
Number of processors : 35
Saving the results to
/netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/linearsvc/binary/CN_MCI
-------------------------
All datasets contain:
86 samples, 2 classes, 2 features
Class CN : 71 samples
Class MCI : 15 samples
-------------------------
Estimated chance accuracy : 0.500
Different classes in the training set are stratified to match the smallest class!
Parallelizing the repetitions of CV with 35 processes ...
Do you have an idea about that?
The error is linked to the LinearSVC classifier from liblinear. If I use svc(kernel='linear') from libsvm it works !
Congrats on being able to customize your own version of neuropredict. This is awesome! Great job.
It appears error is to do with : ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
can you check your input csv files to ensure there are no NaNs or Inf or missing values etc
also can you push the changes you made to your fork, so I can take a look at it to see if there are any potential mistakes there?
I have checked my input csv files and there as no NaNs or Inf or missing values.
I will push the changes I've made to my fork. I have made the following upgrades on algorithms.py
and config_neuropredict.py
:
train_class_sizes
as argument of clf_builder
in order to choice activation or not of dual optimization according size of features relative to size of training samplesclf_builder
in order to solve reproducible problemYou can check the introduction of LinearSVC based on liblinear implementation that failed at lines 451 to 455.
Libsvm LinearSVC and LogisiticRegression worked but graphics with feature importances were empty. Any suggestion?
Hi Matt,
the upcoming version #51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.
Happy holidays! :)
Hi Pradeep,
That are great news ! Thanks for the update.
Happy holidays :)
Matthieu
Le lun. 16 déc. 2019 à 18:34, Pradeep Reddy Raamana < notifications@github.com> a écrit :
Hi Matt,
the upcoming version #51 https://github.com/raamana/neuropredict/pull/51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.
Happy holidays! :)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/raamana/neuropredict/issues/48?email_source=notifications&email_token=ABDKSKPFH4YSZCJFKWI3G6LQY633JA5CNFSM4JKV5ERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PNCA#issuecomment-566163080, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDKSKL42F5UIM3C2Q6CFFTQY633JANCNFSM4JKV5ERA .
Hi Pradeep,
After having run linear SVC with neuropredict is there a way with save .pickle files to get back the vector orthogonal to the optimal margin hyperplane (weights associated to each feature) ?
Thanks for helping.
Best regards, Matthieu
Le 16 déc. 2019 à 20:05, Matthieu Vanhoutte matthieuvanhoutte@gmail.com a écrit :
Hi Pradeep,
That are great news ! Thanks for the update.
Happy holidays :)
Matthieu
Le lun. 16 déc. 2019 à 18:34, Pradeep Reddy Raamana <notifications@github.com mailto:notifications@github.com> a écrit : Hi Matt,
the upcoming version #51 https://github.com/raamana/neuropredict/pull/51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.
Happy holidays! :)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/raamana/neuropredict/issues/48?email_source=notifications&email_token=ABDKSKPFH4YSZCJFKWI3G6LQY633JA5CNFSM4JKV5ERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PNCA#issuecomment-566163080, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDKSKL42F5UIM3C2Q6CFFTQY633JANCNFSM4JKV5ERA.
feature importance data saved by neuropredict is very similar to that (if I understand you correctly) - take a look at the CSV output files and PDF plot.
Hi @mattvan83, if you are still working on this, give the latest version a try and let me know if your problems haven't been resolved. I'll close this for now, and let's start a new issue if that doesn't work.
Hi @raamana ,
I would like to add LinearSVC classifier based on liblinear implementation. Does the current implementation of neuropredict need that predictions are based on probability values? Because, LinearSVC doesn't allow prediction of probabilities.