[FEATURE] OLSEnsembler - Githubissues

arthurpaulino commented 5 years ago

This feature is somewhat similar to https://github.com/koaning/scikit-lego/issues/23, but it can work for regressors as well. I'll explain it using a top-down approach.

First, the constructor's signature:

class sklego.ensemble.OLSEnsembler(estimators, n_folds=3, weights_power=1,
                                   method='predict_proba', random_state=42)

But let's suppose we're in a regression context and we're going to use method='predict', instead because it's easier to explain with regressions.

So, we call:

ols_ensembler = OLSEnsembler(estimators, method='predict')
ols_ensembler.fit(X, y)

The goal of fit is to find a good set of weights for the ensemble. The fit method splits X and y in n_folds partitions and computes y_est_i for each estimator i iteratively, 1/n_folds of it in each iteration.

Now we have the predictions y_est_1, ..., y_est_n and we want to find w_1, ..., w_n that minimizes err in:

err = ||sum(w_i * y_est_i)/sum(w_i) - y||

Which is equivalent to minimizing err_ in:

err_ = ||Y_est*w - y||

Where Y_est is a matrix whose columns are y_est_1, ..., y_est_n and w is the column array that contains w_1, ..., w_n.

And this is why I named it OLSEnsembler... it uses ordinary least squares to find w.

Now, making it simple, we can solve that minimization problem with a lr = LinearRegression(fit_intercept=False) estimator by calling w = lr.fit(Y_est, y).coef_.

The weights_power parameter is used to increase (or decrease) the relative value of the estimators with higher coefficients:

w = w**weights_power

When we call ols_ensembler.predict(X), it returns:

np.average(np.array([est.predict(X) for est in estimators]).T, weights=w, axis=1)

For classification problems using method=predict_proba, the process is similar but repeated c-1 times, where c is the number of classes. I can clarify this part if necessary.

koaning commented 5 years ago

From the bayesian perspective there's another way of doing this that might be better. I certainly think there's a good direction you're pointing in but we should discuss some details.

Just so I understand, the goal is to have a set of estimators going into the ensemble that will be trained? If so, why not "take an intelligent average" of all these trained estimators?

koaning commented 5 years ago

My thoughts in this area is that you probably want to incorporate the accuracy of each model when making a prediction. Suppose you have three models, some of them more accurate than others, each making a prediction.

These three models can be trained independently but if we know the typical residual error of each model then we might use this for a "more intelligent reweighing".

Is this inline with what you were proposing?

arthurpaulino commented 5 years ago

the goal is to have a set of estimators going into the ensemble that will be trained? If so, why not "take an intelligent average" of all these trained estimators?

Yes, exactly. The process that I describe is exactly a method to take an intelligent average of the predictions of those estimators.

The OLSEnsembler should not touch the estimators, though. It should use clones of them to generate the y_est arrays.

Is this inline with what you were proposing?

Yes, it is, I think. Unfortunately, I'm not used to the bayesian way of thinking. But I'll explain my idea in more details.

The ideal scenario would be a linear combination of the predictions that matches the expected y target. Suppose we have n estimators and y has length m:

Edit: there's a typo on the image above. The last index of y should be m instead of n!

This linear system rarely has a exact solution and this is why it's so hard to overfit in this case. But it's possible to make ||y - Y_est*w|| as close to 0 as possible (OLS).

And then, when using w as weights for a weighted average, we avoid outputting probabilities out of the [0, 1] range, for instance. It also works as another constraint, as if we also imposed w.sum() == 1.

Have I missed something?

koaning commented 5 years ago

Ah. Now I see what you mean. I like the idea ... I wonder though. It feels like the list of estimators that produce a prediction can be reweighed by a LinearRegression ... which is good base behavior ... but why not allow the user to supply the model that does the reweighing?

arthurpaulino commented 5 years ago

but why not allow the user to supply the model that does the reweighing?

Oh that's even better. It looks like sklearn's RFE or RFECV. If the user doesn't define it, then a Linear Regression can be used by default.

koaning commented 5 years ago

Let's discuss this some more ... (it's an interesting idea!) but there are many sides to consider ...

... why not make a classifier that selects the best predictor? Or reweight all the models? If you have an estimator that can "detect" which of the input models performs best on certain regions ... that might be incredibly useful.

arthurpaulino commented 5 years ago

That's an interesting approach, but it has a very different mechanic to it. Maybe we have enough material for two separate ensemblers?

I have some questions about that approach:

How can you define what a good region is? If the user is trying to maximize AUCROC, using residuals to characterize the quality of a region may not work. For instance:
```
>>> from sklearn.metrics import roc_auc_score as auc
>>> from sklearn.metrics import mean_squared_error as mse
>>> truth = [1, 0]
>>> out1 = [.002, .001]
>>> out2 = [.501, .502]
>>> auc(truth, out1), auc(truth, out2)
(1.0, 0.0)
```
According to the AUCROC, out1 is better than out2
```
>>> mse(truth, out1), mse(truth, out2)
(0.4980025, 0.2505025)
```
But out1 has higher error residuals than out2. Thus, the user would have to provide its own scorer. This is also a problem for the idea that I proposed! Maybe my idea doesn't work for classification problems!
What if the quality of certain predictions has more to do with the estimator's input than with its output? I mean, an estimator can work better for certain regions of the input hyperspace, even though the output for different regions can be very similar numerically.

Right! I understand, we're talking about an heuristic and by definition it's not supposed to be optimum. But I'm raising such questions because quantifying the quality of predictions by "regions" is something that I've been thinking about for a while.

koaning commented 5 years ago

There's an interesting thing here in terms of ... is the regression doing interpolation (in which case my gut says that the classification might work well) or is it doing extrapolation (in which case my gut thinks regression might work better as an ensemble).

it's pydata amsterdam right now, so i might comment a bit more later :)

koaning commented 5 years ago

one thing to consider ... can we come up with an actual example that is solved with the approaches discussed here?

arthurpaulino commented 5 years ago

one thing to consider ... can we come up with an actual example that is solved with the approaches discussed here?

Great idea. Let's test it out first!

import numpy as np
from time import time
from sklearn.metrics import mean_absolute_error as mse
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, KFold
from sklearn.datasets import fetch_california_housing
from sklearn.base import clone

N_EXPS = 40
N_FOLDS = 5

X, y = fetch_california_housing(return_X_y=True)

ests = [ExtraTreesRegressor(n_estimators=5),
        RandomForestRegressor(n_estimators=5)]

count1, count2, count3 = 0, 0, 0

for i in range(N_EXPS):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=i)
    predictions = [est.fit(X_train, y_train).predict(X_test) for est in ests]
    scores = [mse(y_test, prediction) for prediction in predictions]
    best_wo_ensemble = min(scores)

    Y_est = np.empty((X_train.shape[0], len(ests)))
    for train_index, test_index in KFold(n_splits=N_FOLDS).split(X_train):
        X_train_, X_test_ = X_train[train_index], X_train[test_index]
        y_train_, y_test_ = y_train[train_index], y_train[test_index]

        ests_ = [clone(est) for est in ests]  # not really required for testing

        for j, est in enumerate(ests_):
            Y_est[test_index, j] = est.fit(X_train_, y_train_).predict(X_test_)

    weights = LinearRegression(fit_intercept=False).fit(Y_est,
                                                        y_train).coef_

    ensembled_any = mse(y_test, np.array(predictions).T.mean(axis=1))

    ensembled_smart = mse(y_test, np.average(np.array(predictions).T,
                                             weights=weights, axis=1))

    if ensembled_any < best_wo_ensemble:
        count1 += 1
    if ensembled_smart < ensembled_any:
        count2 += 1
    if ensembled_smart < best_wo_ensemble:
        count3 += 1

print('dumb ensemble better than no ensemble:       {}/{}'.format(count1,
                                                                  N_EXPS))
print('weighted ensemble better than dumb ensemble: {}/{}'.format(count2,
                                                                  N_EXPS))
print('weighted ensemble better than no ensemble:   {}/{}'.format(count3,
                                                                  N_EXPS))

Output:

dumb ensemble better than no ensemble:       40/40
weighted ensemble better than dumb ensemble: 19/40
weighted ensemble better than no ensemble:   40/40

Ehhh... Not as interesting as I thought? Maybe I used models that were too similar? Maybe I shouldn't have used all features like that? I don't know! But now we have a code to play with!

koaning commented 4 years ago

A variant of this got pushed. https://github.com/koaning/scikit-lego/pull/253

koaning commented 4 years ago

Will close this issue for now. Doing a cleanup.

koaning / scikit-lego

[FEATURE] OLSEnsembler #130