Add cross-validation based pruning method

martinskogholt commented 5 years ago

In the R version from the Earth package there is the possibility to use pruning method "cv". Rather than using the GCV error, this calculates the mean out-of-fold RSQ for all the terms included by the forward pass and selects the number of terms with the lowest mean out-of-fold RSQ. This greatly reduces MARS's tendency to overfit, when using it for forecasting purposes.

jcrudy commented 5 years ago

@martinskogholt I believe this is something that can already be accomplished by combining py-earth with scikit-learn. Specifically, an Earth model with enable_pruning=False should be placed in a Pipeline with a cross-validating feature selector of some kind. From the documentation, it looks like the RFECV can be made equivalent to a cross-validated version of the pruning pass with the correct choice of arguments.

I'm going to close this issue, but please comment on whether the above solution meets your needs, and if it doesn't please reopen. Also, feel free to ask further questions in case the above explanation is unclear or incomplete.

timocb commented 5 years ago

@jcrudy

@martinskogholt and I are working on this.

I have tried using the following approach:

from pyearth import Earth
from sklearn.datasets.samples_generator import make_regression
from sklearn.feature_selection import RFECV

X, y = make_regression(n_features=2)

model = RFECV(
    estimator=Earth(
        enable_pruning=False,
    ),
    cv=3,
)

model.fit(X, y)

This gives me the following error:

IndexError                                Traceback (most recent call last)
<ipython-input-79-5e605f1006f7> in <module>
     12 )
     13 
---> 14 model.fit(X, y)

~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in fit(self, X, y, groups)
    512         scores = parallel(
    513             func(rfe, self.estimator, X, y, train, test, scorer)
--> 514             for train, test in cv.split(X, y, groups))
    515 
    516         scores = np.sum(scores, axis=0)

~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in <genexpr>(.0)
    512         scores = parallel(
    513             func(rfe, self.estimator, X, y, train, test, scorer)
--> 514             for train, test in cv.split(X, y, groups))
    515 
    516         scores = np.sum(scores, axis=0)

~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer)
     30     X_test, y_test = _safe_split(estimator, X, y, test, train)
     31     return rfe._fit(
---> 32         X_train, y_train, lambda estimator, features:
     33         _score(estimator, X_test[:, features], y_test, scorer)).scores_
     34 

~/.pyenv/versions/3.5.6/lib/python3.5/site-packages/sklearn/feature_selection/rfe.py in _fit(self, X, y, step_score)
    206             if step_score:
    207                 self.scores_.append(step_score(estimator, features))
--> 208             support_[features[ranks][:threshold]] = False
    209             ranking_[np.logical_not(support_)] += 1
    210 

IndexError: index 3 is out of bounds for axis 1 with size 2

When replacing the Earth model with any other model from scikit-learn, it works as expected.

jcrudy commented 5 years ago

@timocb Thanks for following up. I'm not sure what's going on in your code above, but probably it is an issue with py-earth's scikit-learn compatibility. I'll open a separate issue for it.

Fortunately, this shouldn't impact what you need to do. Here is what I was suggesting more explicitly:

from sklearn.pipeline import Pipeline
from pyearth import Earth
from sklearn.feature_selection.rfe import RFECV
from sklearn.linear_model.base import LinearRegression
from sklearn.metrics.scorer import r2_scorer
from sklearn.datasets.samples_generator import make_regression

X, y = make_regression()

model = Pipeline([('earth', Earth(enable_pruning=False)), 
                  ('rfecv', RFECV(LinearRegression(),
                                  cv=4,
                                  scoring=r2_scorer))])

model.fit(X, y)

The Earth model is in a Pipeline with the RFECV, while a LinearRegression is used as the inner model for recursive feature elimination. With the r2_scorer, this configuration should be equivalent or at least very similar to a version of the pruning pass that uses cross-validation to score terms.

scikit-learn-contrib / py-earth

Add cross-validation based pruning method #188