Open tjhgit opened 5 years ago
Thanks for the note! I think the current implementation should work fine though:
from sklearn.utils import check_X_y
import numpy as np
X = np.random.random((10, 2)).astype(str)
y = np.ones(10)
check_X_y(X, y, accept_sparse=['csc', 'csr'])
returns
(array([['0.9346316090185622', '0.16042444137263268'],
['0.8509128681888157', '0.04362697943384508'],
['0.9017565031657032', '0.3581573277097798'],
['0.17439346265779732', '0.5833740568279369'],
['0.12902336862557318', '0.06860327739703476'],
['0.8507041677034654', '0.35158528908301556'],
['0.6883309548232583', '0.42166588958658446'],
['0.6854723415869861', '0.3736871105678704'],
['0.8142315708984', '0.7559914053700448'],
['0.14472053378019012', '0.07451959739328617']], dtype='<U32'),
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))
However, based on the FutureWarning:
/Users/sebastian/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:532: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64). FutureWarning)
It may stop working in the next release. What would your recommendation be for handling that type of input where you have object-type output? One could simply get rid of chck_X_y, but maybe you have an alternative in mind?
EDIT: Ooops, I overlooked your suggestion ... I thought dtype=None
was already the default. Makes sense then to make that change.
Shouldn't it be check_X_y(X, y, accept_sparse=['csc', 'csr'], force_all_finite='allow-nan')
? The nan's in X should be cared for by the models, right?
Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to force_all_finite=False
(but maybe add a user warning if there is an np.nan
or np.inf
in the input
Good point, best not to discard any kind of data a prior.
There's another issue, somehow check_X_y
also converts the X
and y
to numpy arrays when passing a DataFrames or a Series, I couldn't figure out an option to disable this kind of behavior. In the end, the easiest fix for me for the time being was to comment out the check.
Also, the StackingClassifier
has no check whatsoever, perhaps both stacking classifiers should have the same checks in place?
Yeah, I think it may not be necessary to recast the inputs to NumPy arrays ... depending on the behavior of the estimators used within the stacking classifier.
Regarding the inconsistency between the StackingCVClassifier and the StackingClassifier, that's probably because they were implemented independently and have been modified inconsistently with respect to each other. Some time ago, we tried to unify them in terms of (Python) class inheritance but haven't done so yet.
Thanks for the note. Yes, the idea is that the models should take care of that. One might argue that this is also true for "np.inf". So, we could change it to
force_all_finite=False
(but maybe add a user warning if there is annp.nan
ornp.inf
in the input
I ran into this issue in stacking_cv_classification.py
which has the same check_X_y
call. One of my level 1 classifiers is an XGBClassifier
, which can handle NaNs in the X
input, but since my features contain many NaNs, this check was failing, and .fit
was raising an unnecessary error. Adding the arg force_all_finite=False
resolved the issue.
Are you open to a PR for this?
(Just curious: the recent sklearn
implementation, StackingClassifier
does not seem to have a call to check_X_y
at all. Perhaps it can be avoided altogether?)
Are you open to a PR for this?
Yes, I am open to it. I think I already addressed that via #606 though, but you are welcome to take another look if there are still issues.
@rasbt Thanks for pointing me to that issue; I hadn't seen it, and it does resolve my error.
Unfortunately in my firewalled env I'm restricted to using official pip releases. The MR from #606 came after the current v0.17 on PyPy. Any idea when v0.18 will be released, or is there a possibility for releasing v0.17.1 with the various changes since? That would be awesome.
@tjhgit As for this Issue, it can probably be closed, unless @tjhgit wants to re-introduce the check_X_y
call (but now with force_all_finite=False
) for the other checks it performs. From my side, fine to remove it altogether as it currently is in master.
just catching up with notifications: Made a 0.17.1 version ~2 weeks back that should have that fix :).
currently stacking_cv_regression.py uses X, y = check_X_y(X, y, accept_sparse=['csc', 'csr']) which checks for numeric input. However if the first layer models perform a categorical encoding it is perfectly fine not to have numeric but rather np.array of type object.
So please change the check to: X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None)