Closed arthurpaulino closed 4 years ago
From the bayesian perspective there's another way of doing this that might be better. I certainly think there's a good direction you're pointing in but we should discuss some details.
Just so I understand, the goal is to have a set of estimators going into the ensemble that will be trained? If so, why not "take an intelligent average" of all these trained estimators?
My thoughts in this area is that you probably want to incorporate the accuracy of each model when making a prediction. Suppose you have three models, some of them more accurate than others, each making a prediction.
These three models can be trained independently but if we know the typical residual error of each model then we might use this for a "more intelligent reweighing".
Is this inline with what you were proposing?
the goal is to have a set of estimators going into the ensemble that will be trained? If so, why not "take an intelligent average" of all these trained estimators?
Yes, exactly. The process that I describe is exactly a method to take an intelligent average of the predictions of those estimators.
The OLSEnsembler
should not touch the estimators, though. It should use clones of them to generate the y_est
arrays.
Is this inline with what you were proposing?
Yes, it is, I think. Unfortunately, I'm not used to the bayesian way of thinking. But I'll explain my idea in more details.
The ideal scenario would be a linear combination of the predictions that matches the expected y
target. Suppose we have n estimators and y has length m:
y
should be m
instead of n
!This linear system rarely has a exact solution and this is why it's so hard to overfit in this case. But it's possible to make ||y - Y_est*w||
as close to 0 as possible (OLS).
And then, when using w
as weights for a weighted average, we avoid outputting probabilities out of the [0, 1] range, for instance. It also works as another constraint, as if we also imposed w.sum() == 1
.
Have I missed something?
Ah. Now I see what you mean. I like the idea ... I wonder though. It feels like the list of estimators that produce a prediction can be reweighed by a LinearRegression
... which is good base behavior ... but why not allow the user to supply the model that does the reweighing?
Let's discuss this some more ... (it's an interesting idea!) but there are many sides to consider ...
... why not make a classifier that selects the best predictor? Or reweight all the models? If you have an estimator that can "detect" which of the input models performs best on certain regions ... that might be incredibly useful.
That's an interesting approach, but it has a very different mechanic to it. Maybe we have enough material for two separate ensemblers?
I have some questions about that approach:
How can you define what a good region is? If the user is trying to maximize AUCROC, using residuals to characterize the quality of a region may not work. For instance:
>>> from sklearn.metrics import roc_auc_score as auc
>>> from sklearn.metrics import mean_squared_error as mse
>>> truth = [1, 0]
>>> out1 = [.002, .001]
>>> out2 = [.501, .502]
>>> auc(truth, out1), auc(truth, out2)
(1.0, 0.0)
According to the AUCROC, out1
is better than out2
>>> mse(truth, out1), mse(truth, out2)
(0.4980025, 0.2505025)
But out1
has higher error residuals than out2
. Thus, the user would have to provide its own scorer. This is also a problem for the idea that I proposed! Maybe my idea doesn't work for classification problems!
What if the quality of certain predictions has more to do with the estimator's input than with its output? I mean, an estimator can work better for certain regions of the input hyperspace, even though the output for different regions can be very similar numerically.
Right! I understand, we're talking about an heuristic and by definition it's not supposed to be optimum. But I'm raising such questions because quantifying the quality of predictions by "regions" is something that I've been thinking about for a while.
There's an interesting thing here in terms of ... is the regression doing interpolation (in which case my gut says that the classification might work well) or is it doing extrapolation (in which case my gut thinks regression might work better as an ensemble).
it's pydata amsterdam right now, so i might comment a bit more later :)
one thing to consider ... can we come up with an actual example that is solved with the approaches discussed here?
one thing to consider ... can we come up with an actual example that is solved with the approaches discussed here?
Great idea. Let's test it out first!
import numpy as np
from time import time
from sklearn.metrics import mean_absolute_error as mse
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, KFold
from sklearn.datasets import fetch_california_housing
from sklearn.base import clone
N_EXPS = 40
N_FOLDS = 5
X, y = fetch_california_housing(return_X_y=True)
ests = [ExtraTreesRegressor(n_estimators=5),
RandomForestRegressor(n_estimators=5)]
count1, count2, count3 = 0, 0, 0
for i in range(N_EXPS):
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=i)
predictions = [est.fit(X_train, y_train).predict(X_test) for est in ests]
scores = [mse(y_test, prediction) for prediction in predictions]
best_wo_ensemble = min(scores)
Y_est = np.empty((X_train.shape[0], len(ests)))
for train_index, test_index in KFold(n_splits=N_FOLDS).split(X_train):
X_train_, X_test_ = X_train[train_index], X_train[test_index]
y_train_, y_test_ = y_train[train_index], y_train[test_index]
ests_ = [clone(est) for est in ests] # not really required for testing
for j, est in enumerate(ests_):
Y_est[test_index, j] = est.fit(X_train_, y_train_).predict(X_test_)
weights = LinearRegression(fit_intercept=False).fit(Y_est,
y_train).coef_
ensembled_any = mse(y_test, np.array(predictions).T.mean(axis=1))
ensembled_smart = mse(y_test, np.average(np.array(predictions).T,
weights=weights, axis=1))
if ensembled_any < best_wo_ensemble:
count1 += 1
if ensembled_smart < ensembled_any:
count2 += 1
if ensembled_smart < best_wo_ensemble:
count3 += 1
print('dumb ensemble better than no ensemble: {}/{}'.format(count1,
N_EXPS))
print('weighted ensemble better than dumb ensemble: {}/{}'.format(count2,
N_EXPS))
print('weighted ensemble better than no ensemble: {}/{}'.format(count3,
N_EXPS))
Output:
dumb ensemble better than no ensemble: 40/40
weighted ensemble better than dumb ensemble: 19/40
weighted ensemble better than no ensemble: 40/40
Ehhh... Not as interesting as I thought? Maybe I used models that were too similar? Maybe I shouldn't have used all features like that? I don't know! But now we have a code to play with!
A variant of this got pushed. https://github.com/koaning/scikit-lego/pull/253
Will close this issue for now. Doing a cleanup.
This feature is somewhat similar to https://github.com/koaning/scikit-lego/issues/23, but it can work for regressors as well. I'll explain it using a top-down approach.
First, the constructor's signature:
But let's suppose we're in a regression context and we're going to use
method='predict'
, instead because it's easier to explain with regressions.So, we call:
The goal of
fit
is to find a good set of weights for the ensemble. Thefit
method splitsX
andy
inn_folds
partitions and computesy_est_i
for each estimatori
iteratively,1/n_folds
of it in each iteration.Now we have the predictions
y_est_1
, ...,y_est_n
and we want to findw_1
, ...,w_n
that minimizeserr
in:err = ||sum(w_i * y_est_i)/sum(w_i) - y||
Which is equivalent to minimizing
err_
in:err_ = ||Y_est*w - y||
Where
Y_est
is a matrix whose columns arey_est_1
, ...,y_est_n
andw
is the column array that containsw_1
, ...,w_n
.And this is why I named it
OLSEnsembler
... it uses ordinary least squares to findw
.Now, making it simple, we can solve that minimization problem with a
lr = LinearRegression(fit_intercept=False)
estimator by callingw = lr.fit(Y_est, y).coef_
.The
weights_power
parameter is used to increase (or decrease) the relative value of the estimators with higher coefficients:w = w**weights_power
When we call
ols_ensembler.predict(X)
, it returns:For classification problems using
method=predict_proba
, the process is similar but repeatedc-1
times, wherec
is the number of classes. I can clarify this part if necessary.