sktime / sktime

A unified framework for machine learning with time series
https://www.sktime.net
BSD 3-Clause "New" or "Revised" License
7.57k stars 1.29k forks source link

[BUG] evaluate returns incorrect results with multiple instances #4952

Closed davidgilbertson closed 11 months ago

davidgilbertson commented 11 months ago

Describe the bug The evaluate function appears to do the following steps, for a single fold of the passed CV:

I think this is wrong, it should first concatenate the predictions from all the instances and then calculate the score.

Note that this is fine with a score like MSE, but not RMSE.

To Reproduce The below two snippets should in theory produce the same results:

Manual train/test split, fit, predict:

y_train, y_test, X_train, X_test = temporal_train_test_split(
    X=df.drop(columns=["Target"]),
    y=df[["Target"]],
    test_size=horizon,
)

pipe.fit(
    fh=fh,
    y=y_train,
    X=X_train,
)

y_pred = pipe.predict(X=X_test)

score = rmsle(y_test, y_pred)

Using evaluate and a splitter with just 1 fold:

eval_results = evaluate(
    forecaster=pipe,
    cv=ExpandingGreedySplitter(folds=1, horizon=horizon),
    X=df.drop(columns=["Target"]),
    y=df[["Target"]],
    scoring=make_forecasting_scorer(rmsle, name="RMSLE"),
)
eval_score = eval_results.test_RMSLE.mean()

And they do match, if there's only one instance, but with more they start to vary.

Either I'm missing an option that allows me to bypass this behaviour or this is quite troublesome as it produces incorrect results.

Problem code is here, where it vectorizes an evaluator in the same way it vectorizes a forecaster: https://github.com/sktime/sktime/blob/f4815d7fcd15e4d8936dce770066062e36ed82a8/sktime/performance_metrics/forecasting/_classes.py#L264-L282

Versions

System: python: 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0] executable: /home/davidg/.virtualenvs/learning/bin/python machine: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31 Python dependencies: pip: 23.2 sktime: 0.21.0 sklearn: 1.3.0 skbase: 0.4.6 numpy: 1.23.5 scipy: 1.10.1 pandas: 2.0.3 matplotlib: 3.7.0 joblib: 1.3.1 statsmodels: 0.13.5 numba: None pmdarima: 2.0.3 tsfresh: None tensorflow: None tensorflow_probability: None
fkiraly commented 11 months ago

Hm, interesting.

To prefact my reply, I think "incorrect" is always with regards to an expectation - I fully agree that yours is "a" reasonable expectation of what could/should happen here. I think though the status quo default is also reasonale to expect, so we have two expectations now:

I have two questions, one to drill down why you have a strong preference for the pooling, and one regarding options to add the functionality.

what makes you expect pooling?

In sklearn as well as in sktime, metrics have the multioutput argument, which controls how metrics are handled across variales. The default here is uniform_average.

Question: do you equally think that univariate metrics should pool across variables, rather than average? If yes, can you elaborate? If no, what makes it different from instances?

how to address

I can think of a few ways to add the pooled option:

more questions

In the pooling option, how would you address unequal length series?

kenn289 commented 11 months ago

it seems like you've identified a potential issue in the evaluate function within the sktime library, specifically with the vectorized version of _evaluate. The function appears to calculate the score for each instance separately, which can lead to incorrect results for certain metrics like RMSE that require aggregating the scores across all instances.

To fix this issue, you can try modifying the _evaluate_vectorized function in the sktime library to concatenate the predictions from all instances before calculating the score. Here's an example of how the function can be modified:

from sktime.utils.data_processing import concat_nested_arrays

... (other code)

def _evaluate_vectorized(self, y_true, y_pred, **kwargs): """Vectorized version of _evaluate.

Runs _evaluate for all instances in y_true, y_pred,
and returns results in a hierarchical pandas.DataFrame.

Parameters
----------
y_true : VectorizedDF
y_pred : VectorizedDF
    non-time-like instances of y_true, y_pred must be identical
"""
# Concatenate predictions from all instances
y_pred_concatenated = concat_nested_arrays(y_pred)

eval_result = y_true.vectorize_est(
    estimator=self.clone(),
    method="_evaluate",
    varname_of_self="y_true",
    args={**kwargs, "y_pred": y_pred_concatenated},
    colname_default=self.name,
)
# Aggregate the results across instances (e.g., take the mean for RMSE)
eval_result = eval_result.groupby(level=0).mean()

return eval_result

... (other code)

fkiraly commented 11 months ago

@kenn289, I hid your reply as it may be generated by a chatbot and refers to modules that don't exist. It is also not advised to modify sktime internals locally for reasons of stability, so your suggested workaround (even if it would work) would not be advised.

davidgilbertson commented 11 months ago

Oh interesting, I try not to assume "incorrect" when something doesn't meet my expectations, but I thought this was an unintentional mistake. I wasn't aware that some folks would expect to get the average of the individual instance scores.

do you equally think that univariate metrics should pool across variables, rather than average?

I don't have any opinion about pooling across multiple variables (I've never worked with multivariate problems). But wouldn't think that pooling values of different variables (with potentially different units/scales/concepts) makes much sense.

If no, what makes it different from instances?

With one variable over multiple instances you're measuring the same thing in all cases. Temperature, web traffic, product sales, whatever. To me, looking at per-instance scores makes sense, pooling everything together and getting a score also makes sense, but aggregating once (per instance) then aggregating the aggregates (for the total) seems prone to being skewed by outliers.

A more concrete answer to that is that right now I'm doing a Kaggle competition that has multiple instances and the score calculated by Kaggle is 'pooled', and before moving to sktime, using scikit-learn, I was also getting pooled results, and after moving to sktime, but using my own function for scoring, I was getting pooled results. Only when going to evaluate did I start getting different numbers (although I see now this occurs with all built in sktime metrics).

Just looking at the differences on some true and pred values (multi-index, multi-instance series) I have in my environment, the scores are entirely different. I suspect some users would be surprised by this. It wasn't clear to me from the docs that the sktime variants will give different results to the sklearn equivalents.

def plain_rmse(a, b):
    return np.sqrt(np.mean(np.square(a - b)))

sktime_rmse = MeanSquaredError(square_root=True)
sklearn_rmse = functools.partial(mean_squared_error, squared=False)

sktime_score = sktime_rmse(true, pred)
sktime_score2 = sktime_rmse(true.reset_index(drop=True), pred.reset_index(drop=True))
plain_score = plain_rmse(true, pred)
sklearn_score = sklearn_rmse(true, pred)

print(sktime_score)   # 79.49239876952231
print(sktime_score2)  # 184.6367338511932
print(plain_score)    # 184.6367338511932
print(sklearn_score)  # 184.6367338511932

In the pooling option, how would you address unequal length series?

I'm not sure what you mean. If the data is provided as a series (with multi-index), doesn't simply comparing the values give the pooled result? Or do you mean if one instance only has two values and another has 2,000 then not pooling effectively gives more weight to the values in the small instance, which could be viewed as pooling giving more weight to instances with more data. That's perhaps relevant to consider when choosing between pooled/not pooled, but does it need to be addressed in sktime code, or am I not following your question? With that said, I've never been in a situation where my prediction horizon was different for different instances.

how to address

multilevel="pooled" seems sensible.

BTW multilevel is missing from the docstring of MeanSquaredError. Also, is 'level' meant to be synonymous with 'instance'? The docs seem to flick between "level", "instance" and occasionally "hierarchy unit" (none of which are in the glossary).

fkiraly commented 11 months ago

But wouldn't think that pooling values of different variables (with potentially different units/scales/concepts) makes much sense.

Interesting - I think this argument also applies to hierarchical data. E.g., think of hierarchies where you have different products, and sales per day. These will typically also live on a much different scale, e.g., dishwashers (gets sold once in a while), or apples (a lot get sold).

different results to the sklearn equivalents

That's a good point - the sklearn scores ignore the index, always.

multilevel="pooled" seems sensible.

ok, let's do that.

Also, is 'level' meant to be synonymous with 'instance'? The docs seem to flick between "level", "instance" and occasionally "hierarchy unit" (none of which are in the glossary).

The most common term differs, depending whether we are looking at a panel or the hierarchical setting. This should perhaps e in the glossary, yes. Have you seen the "time series" noteook from the last conferene tutorials, is that addressing the "terms" question?

fkiraly commented 11 months ago

Or do you mean if one instance only has two values and another has 2,000 then not pooling effectively gives more weight to the values in the small instance, which could be viewed as pooling giving more weight to instances with more data.

Yes

That's perhaps relevant to consider when choosing between pooled/not pooled, but does it need to be addressed in sktime code, or am I not following your question?

Hm, on could argue that this should be weghted by length. I see though how this is an unusual case.

fkiraly commented 11 months ago

So, @davidgilbertson, funny story.

Turns out you can already use the pooled logic by specifying multilevel="uniform_average_time", and the docstrings didn't tell the user (partly since no one has picked up the long open "please fix metric docstrings" issue, still open...)

I realize multilevel="uniform_average_time" might also e an unintuitive name for this - should we change it to "pooled"?

PR that ensures we at least document and tests the status quo is here: https://github.com/sktime/sktime/pull/4960

davidgilbertson commented 11 months ago

That is a funny story! I think pooled would be a better name, whether or not it's worth the breaking change I'll leave to others.

Speaking of 'panel' or 'hierarchical' settings, I never quite understood the purpose of the distinction. Is there a reason there isn't just the concept of "a hierarchy of instances with one or more levels"? Or to put it another way, is there any reason that the "hierarchical" concept/logic/code would fall apart if there was only one level in the hierarchy (plus the date)?

Have you seen the "time series" noteook from the last conferene tutorials, is that addressing the "terms" question?

I skimmed this, is that what you're referring to? https://github.com/sktime/sktime-tutorial-pydata-london-2023/blob/main/notebooks/02_panel_tasks.ipynb. Regardless, I think I get what all the terms mean, was just pointing out that it's confusing to have multiple words where one would do (e.g. a user might overlook a setting called 'levels' because they're looking for something about instances).

fkiraly commented 11 months ago

whether or not it's worth the breaking change I'll leave to others.

If adding it as an alias, it is not a breaking change. We can slowly deprecate the odd name then...

Speaking of 'panel' or 'hierarchical' settings, I never quite understood the purpose of the distinction. Is there a reason there isn't just the concept of "a hierarchy of instances with one or more levels"?

It is more on the mechanical level than on the scientific/conceptual.

Specifically, this is about mtype rather than scitype (if you are familiar with the internal data representation logic, if not, I'll explain below).

The issue is, there are about a dozen of representations for panel data (flat / single-level hierarchices) that float around out there, including 3D numpy arrays, lists of pd.DataFrame, nested pd.DataFrame with pd.Series in it, and so on. In particular the 3D array representation is very popular. See datatypes._panel._registry for all the machine types.

For general hierarchical, we do not have the problem - as there are less commonly used representations out there, and most are pd.DataFrame based afaik (and the main distinctions being whether the hierarchy index is part of the main data frame or not).

So this is more of a side effect of panel and hierarchical having a distinction in terms of data representation in the ecosystem - but as you say, I agree, from a scientific/abstract data type view point, a panel should just be a 1-level hierarchical (and a series should be 0-level hierarchicail).

I skimmed this, is that what you're referring to?

No, this: https://github.com/sktime/sktime-tutorial-europython-2023/blob/main/notebooks/02_timeseries.ipynb

I think I get what all the terms mean, was just pointing out that it's confusing to have multiple words where one would do

Yes, I know - concrete suggestions on how to streamline this without too many breaking changes are very much appreciated.

davidgilbertson commented 11 months ago

I'm not sure I've got any concrete suggestions that aren't obvious already. But I do spend a lot of time thinking about terminology :) so here's some thoughts:

I don't have any comments on the actual code implementation, I assume having multiple scitypes for hierarchical is not even worth thinking about changing.

fkiraly commented 11 months ago

on level vs hierarchy, if these are essentially interchangeable

Hm, not to a 100%. "level" is the term for pd.DataFrame index "levels". Only when we use them to represent hierarchical data, they correspond to nodes or levels of a hierarchy.

It adds a layer of cognitive overhead that seems unnecessary.

I see, perhaps it's worth going through the tutorial with this in mind?

davidgilbertson commented 11 months ago

Darn it, that was a brain typo, I meant "level vs instance" (vs node)

fkiraly commented 11 months ago

Darn it, that was a brain typo, I meant "level vs instance" (vs node)

Ah, I see - yes, these live in the abstract sphere, that's a very valid poit.