scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.37k stars 25.24k forks source link

Classification + regression? #5418

Closed naught101 closed 8 years ago

naught101 commented 8 years ago

Is there currently a way to create a pipeline that classifies the data, then does individual regressions on the data in each class (perhaps using different features)?

This has been done before, e.g. using a SOM + linear regression (Hsu et al, 2002).

If it isn't already possible, would something like this be worth including in sklearn? Or should I just make a separate project?

Hsu, Kuo-lin, Hoshin V. Gupta, Xiaogang Gao, Soroosh Sorooshian, and Bisher Imam 2002 Self-Organizing Linear Output Map (SOLO): An Artificial Neural Network Suitable for Hydrologic Modeling and Analysis. Water Resources Research 38 (12): 38–1–38–17. http://doi.wiley.com/10.1029/2001WR000795, accessed November 28, 2013.

amueller commented 8 years ago

What do you mean by classifying and then doing a regression? SOM's are unsupervised, right? So you want to cluster the data and then run a separate regression model on each cluster?

There is no ready-made solution for that in scikit-learn. In particular if you want to use different features, this doesn't really fit into the scikit-learn interface. Using the same features, you could easily write a model using scikit-learn that does that.

amueller commented 8 years ago

I think the method is too specialized to add it to scikit-learn, so closing.

jnothman commented 8 years ago

For clusterers supporting predict, it's not complicated to produce something like (untested):

from sklearn.base import BaseEstimator, Clone
from sklearn.utils import safe_mask
class ModelByCluster(BaseEstimator):
    def __init__(self, clusterer, estimator):
        self.clusterer = clusterer
        self.estimator = estimator
    def fit(self, X, y):
        self.clusterer_ = clone(self.clusterer)
        clusters = self.clusterer_.fit_predict(X)
        n_clusters = len(np.unique(clusters))
        self.estimators_ = []
        for c in range(n_clusters):
            mask = clusters == c
            est = clone(self.estimator)
            est.fit(X[safe_mask(X, mask)], y[safe_mask(y, mask)])
            self.estimators_.append(est)
        return self
    def predict(self, X):
        clusters = self.clusterer_.predict(X)
        y_tmp = []
        idx = []
        for c, est in enumerate(self.estimators_):
            mask = clusters == c
            idx.append(np.flatnonzero(mask))
            predictions.append(est.predict(X[safe_mask(X, mask)]))
        y_tmp = np.concatenate(y_tmp)
        idx = np.concatenate(idx)
        y = np.empty_like(y_tmp)
        y[idx] = y_tmp
        return y
naught101 commented 8 years ago

Thanks @jorthman, that's exactly the kind of thing I what I was talking about. I guess I could extend something like that to split the features for clustering and regression, if necessary.

jnothman commented 8 years ago

Now a gist: https://gist.github.com/jnothman/566ebde618ec18f2bea6