Closed naught101 closed 9 years ago
What do you mean by classifying and then doing a regression? SOM's are unsupervised, right? So you want to cluster the data and then run a separate regression model on each cluster?
There is no ready-made solution for that in scikit-learn. In particular if you want to use different features, this doesn't really fit into the scikit-learn interface. Using the same features, you could easily write a model using scikit-learn that does that.
I think the method is too specialized to add it to scikit-learn, so closing.
For clusterers supporting predict, it's not complicated to produce something like (untested):
from sklearn.base import BaseEstimator, Clone
from sklearn.utils import safe_mask
class ModelByCluster(BaseEstimator):
def __init__(self, clusterer, estimator):
self.clusterer = clusterer
self.estimator = estimator
def fit(self, X, y):
self.clusterer_ = clone(self.clusterer)
clusters = self.clusterer_.fit_predict(X)
n_clusters = len(np.unique(clusters))
self.estimators_ = []
for c in range(n_clusters):
mask = clusters == c
est = clone(self.estimator)
est.fit(X[safe_mask(X, mask)], y[safe_mask(y, mask)])
self.estimators_.append(est)
return self
def predict(self, X):
clusters = self.clusterer_.predict(X)
y_tmp = []
idx = []
for c, est in enumerate(self.estimators_):
mask = clusters == c
idx.append(np.flatnonzero(mask))
predictions.append(est.predict(X[safe_mask(X, mask)]))
y_tmp = np.concatenate(y_tmp)
idx = np.concatenate(idx)
y = np.empty_like(y_tmp)
y[idx] = y_tmp
return y
Thanks @jorthman, that's exactly the kind of thing I what I was talking about. I guess I could extend something like that to split the features for clustering and regression, if necessary.
Is there currently a way to create a pipeline that classifies the data, then does individual regressions on the data in each class (perhaps using different features)?
This has been done before, e.g. using a SOM + linear regression (Hsu et al, 2002).
If it isn't already possible, would something like this be worth including in sklearn? Or should I just make a separate project?
Hsu, Kuo-lin, Hoshin V. Gupta, Xiaogang Gao, Soroosh Sorooshian, and Bisher Imam 2002 Self-Organizing Linear Output Map (SOLO): An Artificial Neural Network Suitable for Hydrologic Modeling and Analysis. Water Resources Research 38 (12): 38–1–38–17. http://doi.wiley.com/10.1029/2001WR000795, accessed November 28, 2013.