rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

stacking_cv_regression.py should use X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None) #562

Open tjhgit opened 5 years ago

tjhgit commented 5 years ago

currently stacking_cv_regression.py uses X, y = check_X_y(X, y, accept_sparse=['csc', 'csr']) which checks for numeric input. However if the first layer models perform a categorical encoding it is perfectly fine not to have numeric but rather np.array of type object.

So please change the check to: X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None)

rasbt commented 5 years ago

Thanks for the note! I think the current implementation should work fine though:

from sklearn.utils import check_X_y
import numpy as np

X = np.random.random((10, 2)).astype(str)
y = np.ones(10)
check_X_y(X, y, accept_sparse=['csc', 'csr'])

returns

(array([['0.9346316090185622', '0.16042444137263268'],
        ['0.8509128681888157', '0.04362697943384508'],
        ['0.9017565031657032', '0.3581573277097798'],
        ['0.17439346265779732', '0.5833740568279369'],
        ['0.12902336862557318', '0.06860327739703476'],
        ['0.8507041677034654', '0.35158528908301556'],
        ['0.6883309548232583', '0.42166588958658446'],
        ['0.6854723415869861', '0.3736871105678704'],
        ['0.8142315708984', '0.7559914053700448'],
        ['0.14472053378019012', '0.07451959739328617']], dtype='<U32'),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

However, based on the FutureWarning:

/Users/sebastian/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:532: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64). FutureWarning)

It may stop working in the next release. What would your recommendation be for handling that type of input where you have object-type output? One could simply get rid of chck_X_y, but maybe you have an alternative in mind?

EDIT: Ooops, I overlooked your suggestion ... I thought dtype=None was already the default. Makes sense then to make that change.

pwl commented 5 years ago

Shouldn't it be check_X_y(X, y, accept_sparse=['csc', 'csr'], force_all_finite='allow-nan')? The nan's in X should be cared for by the models, right?

rasbt commented 5 years ago

Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False (but maybe add a user warning if there is an np.nan or np.inf in the input

pwl commented 5 years ago

Good point, best not to discard any kind of data a prior.

There's another issue, somehow check_X_y also converts the X and y to numpy arrays when passing a DataFrames or a Series, I couldn't figure out an option to disable this kind of behavior. In the end, the easiest fix for me for the time being was to comment out the check.

Also, the StackingClassifier has no check whatsoever, perhaps both stacking classifiers should have the same checks in place?

rasbt commented 5 years ago

Yeah, I think it may not be necessary to recast the inputs to NumPy arrays ... depending on the behavior of the estimators used within the stacking classifier.

Regarding the inconsistency between the StackingCVClassifier and the StackingClassifier, that's probably because they were implemented independently and have been modified inconsistently with respect to each other. Some time ago, we tried to unify them in terms of (Python) class inheritance but haven't done so yet.

rchaves33 commented 4 years ago

Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False (but maybe add a user warning if there is an np.nan or np.inf in the input

I ran into this issue in stacking_cv_classification.py which has the same check_X_y call. One of my level 1 classifiers is an XGBClassifier, which can handle NaNs in the X input, but since my features contain many NaNs, this check was failing, and .fit was raising an unnecessary error. Adding the arg force_all_finite=False resolved the issue.

Are you open to a PR for this?

(Just curious: the recent sklearn implementation, StackingClassifier does not seem to have a call to check_X_y at all. Perhaps it can be avoided altogether?)

rasbt commented 4 years ago

Are you open to a PR for this?

Yes, I am open to it. I think I already addressed that via #606 though, but you are welcome to take another look if there are still issues.

rchaves33 commented 4 years ago

@rasbt Thanks for pointing me to that issue; I hadn't seen it, and it does resolve my error.

Unfortunately in my firewalled env I'm restricted to using official pip releases. The MR from #606 came after the current v0.17 on PyPy. Any idea when v0.18 will be released, or is there a possibility for releasing v0.17.1 with the various changes since? That would be awesome.

@tjhgit As for this Issue, it can probably be closed, unless @tjhgit wants to re-introduce the check_X_y call (but now with force_all_finite=False) for the other checks it performs. From my side, fine to remove it altogether as it currently is in master.

rasbt commented 4 years ago

just catching up with notifications: Made a 0.17.1 version ~2 weeks back that should have that fix :).