Closed martinskogholt closed 5 years ago
@martinskogholt I believe this is something that can already be accomplished by combining py-earth with scikit-learn. Specifically, an Earth
model with enable_pruning=False
should be placed in a Pipeline
with a cross-validating feature selector of some kind. From the documentation, it looks like the RFECV
can be made equivalent to a cross-validated version of the pruning pass with the correct choice of arguments.
I'm going to close this issue, but please comment on whether the above solution meets your needs, and if it doesn't please reopen. Also, feel free to ask further questions in case the above explanation is unclear or incomplete.
@jcrudy
@martinskogholt and I are working on this.
I have tried using the following approach:
from pyearth import Earth
from sklearn.datasets.samples_generator import make_regression
from sklearn.feature_selection import RFECV
X, y = make_regression(n_features=2)
model = RFECV(
estimator=Earth(
enable_pruning=False,
),
cv=3,
)
model.fit(X, y)
This gives me the following error:
IndexError Traceback (most recent call last)
<ipython-input-79-5e605f1006f7> in <module>
12 )
13
---> 14 model.fit(X, y)
~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in fit(self, X, y, groups)
512 scores = parallel(
513 func(rfe, self.estimator, X, y, train, test, scorer)
--> 514 for train, test in cv.split(X, y, groups))
515
516 scores = np.sum(scores, axis=0)
~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in <genexpr>(.0)
512 scores = parallel(
513 func(rfe, self.estimator, X, y, train, test, scorer)
--> 514 for train, test in cv.split(X, y, groups))
515
516 scores = np.sum(scores, axis=0)
~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer)
30 X_test, y_test = _safe_split(estimator, X, y, test, train)
31 return rfe._fit(
---> 32 X_train, y_train, lambda estimator, features:
33 _score(estimator, X_test[:, features], y_test, scorer)).scores_
34
~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in _fit(self, X, y, step_score)
206 if step_score:
207 self.scores_.append(step_score(estimator, features))
--> 208 support_[features[ranks][:threshold]] = False
209 ranking_[np.logical_not(support_)] += 1
210
IndexError: index 3 is out of bounds for axis 1 with size 2
When replacing the Earth
model with any other model from scikit-learn
, it works as expected.
@timocb Thanks for following up. I'm not sure what's going on in your code above, but probably it is an issue with py-earth's scikit-learn compatibility. I'll open a separate issue for it.
Fortunately, this shouldn't impact what you need to do. Here is what I was suggesting more explicitly:
from sklearn.pipeline import Pipeline
from pyearth import Earth
from sklearn.feature_selection.rfe import RFECV
from sklearn.linear_model.base import LinearRegression
from sklearn.metrics.scorer import r2_scorer
from sklearn.datasets.samples_generator import make_regression
X, y = make_regression()
model = Pipeline([('earth', Earth(enable_pruning=False)),
('rfecv', RFECV(LinearRegression(),
cv=4,
scoring=r2_scorer))])
model.fit(X, y)
The Earth
model is in a Pipeline
with the RFECV
, while a LinearRegression
is used as the inner model for recursive feature elimination. With the r2_scorer
, this configuration should be equivalent or at least very similar to a version of the pruning pass that uses cross-validation to score terms.
In the R version from the Earth package there is the possibility to use pruning method "cv". Rather than using the GCV error, this calculates the mean out-of-fold RSQ for all the terms included by the forward pass and selects the number of terms with the lowest mean out-of-fold RSQ. This greatly reduces MARS's tendency to overfit, when using it for forecasting purposes.