Add new classifier: need of probability output?

mattvan83 commented 4 years ago

Hi @raamana ,

I would like to add LinearSVC classifier based on liblinear implementation. Does the current implementation of neuropredict need that predictions are based on probability values? Because, LinearSVC doesn't allow prediction of probabilities.

raamana commented 4 years ago

Been wanting to do add Linear SVM and Naive Bayes to supported classifiers, adding this to todo list :).

scikit-learn does provide probabilities for some classifiers - I could try output them as well.

thanks for your suggestions and bug reports.

mattvan83 commented 4 years ago

I tried to add Linear SVM in algorithms.py code, then updated config.neuropredict.py to add new classifier. However when launching the command line I still got the error:

usage: neuropredict [-h] [-m META_FILE] [-o OUT_DIR] [-f FS_SUBJECT_DIR]
                    [-y PYRADIGM_PATHS [PYRADIGM_PATHS ...]]
                    [-u USER_FEATURE_PATHS [USER_FEATURE_PATHS ...]]
                    [-d DATA_MATRIX_PATHS [DATA_MATRIX_PATHS ...]]
                    [-a ARFF_PATHS [ARFF_PATHS ...]] [-p POSITIVE_CLASS]
                    [-t TRAIN_PERC] [-n NUM_REP_CV]
                    [-k NUM_FEATURES_TO_SELECT]
                    [-sg [SUB_GROUPS [SUB_GROUPS ...]]]
                    [-g {none,light,exhaustive}]
                    [-is {median,mean,most_frequent,raise}]
                    [-fs {selectkbest_mutual_info_classif,selectkbest_f_classif,variancethreshold}]
                    [-e {randomforestclassifier,extratreesclassifier,decisiontreeclassifier,svm,xgboost}]
                    [-z MAKE_VIS] [-c NUM_PROCS] [--po PRINT_OPT_DIR] [-v]
neuropredict: error: argument -e/--classifier: invalid choice: 'linearsvc' (choose from 'randomforestclassifier', 'extratreesclassifier', 'decisiontreeclassifier', 'svm', 'xgboost')

I have certainly missed one place but where?

mattvan83 commented 4 years ago

I reached advanced error stage since the previous message. I managed to launch my own neuropredict with LinearSVC but got the following error message apprently linked to multiprocessing:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/rhst.py", line 690, in holdout_trial_compare_datasets
    average='weighted')
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 355, in roc_auc_score
    sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/base.py", line 76, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 327, in _binary_roc_auc_score
    sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 622, in roc_curve
    y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 402, in _binary_clf_curve
    assert_all_finite(y_score)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/utils/validation.py", line 72, in assert_all_finite
    _assert_all_finite(X.data if sp.issparse(X) else X, allow_nan)
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/bin/neuropredict", line 11, in <module>
    load_entry_point('neuropredict', 'console_scripts', 'neuropredict')()
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/__main__.py", line 11, in main
    run_workflow.cli()
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/run_workflow.py", line 1049, in cli
    grid_search_level, classifier, feat_select_method)
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/run_workflow.py", line 1024, in prepare_and_run
    options_path=options_path)
  File "/netapp/vol1_homeunix/mvanhoutte/Soft/neuropredict/neuropredict/rhst.py", line 422, in run
    cv_results = pool.map(partial_func_holdout, range(num_repetitions))
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 288, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/homes_unix/mvanhoutte/Soft/anaconda3/envs/neuropredMV/lib/python3.6/multiprocessing/pool.py", line 670, in get
    raise self._value
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The ouput message was stucked at the parallelizing step:


Python 3.6.7
SGE recognized, job set up with 35 slots.
Running neuropredict 0.5+34.g220af55.dirty

Requested features for analysis:
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/metaROI.csv
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/metaROI_split.csv
get_data_matrix from /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/HCP_parcellation.csv
Ignoring imputation strategy chosen, as no missing data were found!

Data import is done.

Requested processing for the following subgroups:
CN,MCI
CN,AD
MCI,AD

--------------------------------------------------------------------------------
Processing subgroup : CN,MCI (1/3)
--------------------------------------------------------------------------------
SGE recognized, job set up with 35 slots.
Training percentage      : 0.8
Number of CV repetitions : 250
Classifier chosen        : linearsvc
Feature selection chosen : variancethreshold
Level of grid search     : exhaustive
Number of processors     : 35
Saving the results to 
  /netapp/vol2_agewell/pro/IMAP/imap_mvh/CAT12/pet/Analyses/ML/All/CN_vs_MCI_vs_AD/fdg/pons/linearsvc/binary/CN_MCI

-------------------------
All datasets contain:

86 samples, 2 classes, 2 features
Class  CN : 71 samples
Class MCI : 15 samples
-------------------------

Estimated chance accuracy : 0.500

Different classes in the training set are stratified to match the smallest class!
Parallelizing the repetitions of CV with 35 processes ...

Do you have an idea about that?

mattvan83 commented 4 years ago

The error is linked to the LinearSVC classifier from liblinear. If I use svc(kernel='linear') from libsvm it works !

raamana commented 4 years ago

Congrats on being able to customize your own version of neuropredict. This is awesome! Great job.

It appears error is to do with : ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

can you check your input csv files to ensure there are no NaNs or Inf or missing values etc

raamana commented 4 years ago

also can you push the changes you made to your fork, so I can take a look at it to see if there are any potential mistakes there?

mattvan83 commented 4 years ago

I have checked my input csv files and there as no NaNs or Inf or missing values.

I will push the changes I've made to my fork. I have made the following upgrades on algorithms.py and config_neuropredict.py:

Add LinearSVC (liblinear or libsvm) and LogisticRegression
Add train_class_sizes as argument of clf_builder in order to choice activation or not of dual optimization according size of features relative to size of training samples
Add random_state parameter to each clf_builder in order to solve reproducible problem
Add these new classifiers in defaults of the configuration neuropredict file
Add these new classifiers in list of feature importance function

You can check the introduction of LinearSVC based on liblinear implementation that failed at lines 451 to 455.

Libsvm LinearSVC and LogisiticRegression worked but graphics with feature importances were empty. Any suggestion?

raamana commented 4 years ago

Hi Matt,

the upcoming version #51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.

Happy holidays! :)

mattvan83 commented 4 years ago

Hi Pradeep,

That are great news ! Thanks for the update.

Happy holidays :)

Matthieu

Le lun. 16 déc. 2019 à 18:34, Pradeep Reddy Raamana < notifications@github.com> a écrit :

Hi Matt,

the upcoming version #51 https://github.com/raamana/neuropredict/pull/51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.

Happy holidays! :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/raamana/neuropredict/issues/48?email_source=notifications&email_token=ABDKSKPFH4YSZCJFKWI3G6LQY633JA5CNFSM4JKV5ERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PNCA#issuecomment-566163080, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDKSKL42F5UIM3C2Q6CFFTQY633JANCNFSM4JKV5ERA .

mattvan83 commented 4 years ago

Hi Pradeep,

After having run linear SVC with neuropredict is there a way with save .pickle files to get back the vector orthogonal to the optimal margin hyperplane (weights associated to each feature) ?

Thanks for helping.

Best regards, Matthieu

Le 16 déc. 2019 à 20:05, Matthieu Vanhoutte matthieuvanhoutte@gmail.com a écrit :

Hi Pradeep,

That are great news ! Thanks for the update.

Happy holidays :)

Matthieu

Le lun. 16 déc. 2019 à 18:34, Pradeep Reddy Raamana <notifications@github.com mailto:notifications@github.com> a écrit : Hi Matt,

the upcoming version #51 https://github.com/raamana/neuropredict/pull/51 would solve many of these issues you identify. Thanks for the feedback and testing and usage, appreciate it.

Happy holidays! :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/raamana/neuropredict/issues/48?email_source=notifications&email_token=ABDKSKPFH4YSZCJFKWI3G6LQY633JA5CNFSM4JKV5ERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PNCA#issuecomment-566163080, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDKSKL42F5UIM3C2Q6CFFTQY633JANCNFSM4JKV5ERA.

raamana commented 4 years ago

feature importance data saved by neuropredict is very similar to that (if I understand you correctly) - take a look at the CSV output files and PDF plot.

raamana commented 4 years ago

Hi @mattvan83, if you are still working on this, give the latest version a try and let me know if your problems haven't been resolved. I'll close this for now, and let's start a new issue if that doesn't work.

raamana / neuropredict

Add new classifier: need of probability output? #48