[ENH] evaluate should accept ensembler separate from forecasters

RNKuhns commented 1 year ago

Is your feature request related to a problem? Please describe.

In the forecasting module, the evaluate function currently lets a user use timeseries cross-validation to compare different forecasters.

This can be individual forecasters or an ensemble forecaster. It is common place in forecasting that appropriately defined ensembles of forecasters produce better forecasts than individual forecasters. The evaluate function currently allows this comparison to be made by passing an ensemble forecaster alongside the individual forecasters to be compared against.

However, this is inefficient in common place forecasting evaluation use cases. For example, fitting several forecasters and then comparing the results amongst the forecasters and different ensembles of the forecasters. In a simple case, where the forecasters are compared to a single ensemble, this would currently result in the forecasters each being fit and predicted twice for each evaluation window (once for each forecaster individually and once in the ensemble forecaster). If there are multiple ensembles being compared the level of duplication grows. This can also be "slow" in the simple case if any of the individual forecasters are "slow" to fit/predict or there are many evaluation windows for refitting.

Describe the solution you'd like

sktime should provide the ability to ensemble forecasts without fitting the underlying models (e.g. given some forecasts, apply the ensembler to the forecasts rather than accepting the training data, fitting the models, making the forecasts and then ensembling them).

The evaluate function should make use of this functionality by including a parameter that accepts an ensembler or list of ensemblers. The evaluate function should use this to provide results for individual forecasters and the specified ensembles (in a given evaluation window the results of the individual forecasters could be temporarily stored for use by provided ensemblers) to let users more efficiently perform commonplace forecast evaluation (in common practice, users should almost always be comparing their forecasts to a challenger model or models, and also comparing if combinations of the model forecasts are better than the individual models).

Describe alternatives you've considered The alternative is the status quo, but as currently constituted, comparing ensembles of forecasters involves inefficiency in performing a standard forecast evaluation.

Additional context I'd propose adding a method to each ensemble forecasters that lets the ensembling be applied to external forecasts. For current ensemblers there would need to be a refactor to separate the application of the ensembling approach from the fitting/predicting of underlying models (into a method also used by the ensemble if a user calls fit/predict on the ensembler). But this seems like an easier way to re-use code than something like separating this out into a separate class/function.

fkiraly commented 1 year ago

For example, fitting several forecasters and then comparing the results amongst the forecasters and different ensembles of the forecasters. In a simple case, where the forecasters are compared to a single ensemble, this would currently result in the forecasters each being fit and predicted twice for each evaluation window (once for each forecaster individually and once in the ensemble forecaster). If there are multiple ensembles being compared the level of duplication grows.

Yes, I do get the efficiency/duplication argument here.

I think there are two points to discuss:

is the use case as stated the right workflow? I think it might be prone to overfitting, and methodologically maybe you want to do sth else (but very similar) instead?
is there a way to do this without increasing coupling between the choice of the ensemble/composite/pipeline and the evaluation method? Or, more generally, what are the ways to do it? I assume you recognize the risk of coupling and the risk making an already high-complexity interface (evaluate) acquire parameters that are only valid in some special use cases.

I'll go through these separately.

fkiraly commented 1 year ago

is the use case as stated the right workflow?

Depending on what you mean, it might lead to "overfitting in evaluation".

Suppose we have 6 component forecasters that we could ensemble over, and 2 ways to aggregate. That gives us $(2^6 - 1) \cdot 2 = 126$ combinations to evaluate.

Let's say what we choose does not have an influence at all on the error metric, but it is independently (identically) distributed.

Then, the "best" model will appear with the maximum order statistic in the sample of 126, i.e., it will have the cdf $F'(x) = F(x) ^ {126}$ of whatever the metric was distributed (assuming wlog that bigger was better).

This skews it pretty far up. For instance, if we start with a uniform distribution on $[0,1]$ (let's say, an R-squared), the point $x$ where $F'(x) = \frac{1}{2}$ will be $x = 0.5 ^ {\frac{1}{126}} \approx 0.995$, that is, it looks like we have a pretty good model with the top performer, it has 0.99-something, even if it is just equally performant as everything else.

fkiraly commented 1 year ago

What is more "sound" imo is to put the ensemble in a multiplexer or other tuner, and tune which models are included. Generalization include weighted ensemble methods, with weights being determined after a single fit of all compnent models, such as AutoEnsembleForeceaster which is currently in sktime.

I do agree that if you would want to do the exhaustive search, it does end up being wasteful, so maybe what you want is a tuner and not a change to the evaluate function? E.g., sth similar to AutoEnsembleForecaster or the online_learning module?

Having said that, it may be worth having a variant of the evaluate function to use inside a tuner? For that, you would need access to predictions that you can then ensemle and compute metrics of.

fkiraly commented 1 year ago

Regarding the point on coupling:

my main question is, can we solve this without adding more args to the evaluate function?

Some thoughts:

if evaluate returns predictions, we can wrap it in another utility that takes the prediction and computes the metrics
would it also make sense to use this to return optimal weights, i.e., have it do the 1/0 on/off case, as well as arbitrary scalar weights?
there is the _PredictionWeightedEnsembler class in online_learning, could something like this (perhaps BaseObject inheriting) encapsulate how we ensemble?

sktime / sktime

[ENH] evaluate should accept ensembler separate from forecasters #3884