unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.56k stars 829 forks source link

[Question] Why CatBoostModel raises a CatBoostError "All train targets are equal" in the included scenarios? #2124

Closed fmerinocasallo closed 5 months ago

fmerinocasallo commented 6 months ago

I am using CatBoostModel and the following code snippet does work without any issues:

import contextlib
import datetime
import os

import pandas as pd
from darts.timeseries import TimeSeries
from darts.models import CatBoostModel

def fcast(series, future_cov, output_chunk_length, lags_future_covariates):
    SPLIT_DATE = datetime.datetime(2022, 1, 1)

    train, valid = series.split_before(pd.Timestamp(SPLIT_DATE))
    model = CatBoostModel(
        output_chunk_length=output_chunk_length,
        lags=[-1],
        lags_future_covariates=lags_future_covariates,
    )

    with contextlib.redirect_stdout(os.devnull):
        model.fit(series=train, future_covariates=future_cov)

    return (
        model.predict(n=12, series=train, future_covariates=future_cov)
        .pd_series().round().clip(lower=0).abs()
    )

if __name__ == "__main__":
    idx = pd.date_range("2021-01-01", periods=24, freq="MS")

    series = TimeSeries.from_series(
        pd.Series(
            [
                0., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26., 32.,
                17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11., 15.,
            ],
            index=idx,
            name="sales",
        )
    )

    future_cov = TimeSeries.from_series(
        pd.Series(
            [
                0., 0., 0., 0., 0., 0., 0., 0., 0., 0.9032258, 1., 1.,
                1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
            ],
            index=idx,
            name="availability",
        )
    )

    fcast(
        series,
        future_cov,
        output_chunk_length=3,
        lags_future_covariates=[0, 1, 2],
    )

:warning: However, if I replace the original definition of series:

    series = TimeSeries.from_series(
        pd.Series(
            [
                0., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26., 32.,
                17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11., 15.,
            ],
            index=idx,
            name="sales",
        )
    )

with one including an additional zero at the beginning of the series and without the last 15:

    series = TimeSeries.from_series(
        pd.Series(
            [
                0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26.,
                32., 17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11.,
            ],
            index=idx,
            name="sales",
        )
    )

I get the following error:

Exception has occurred: CatBoostError
/src/catboost/catboost/libs/metrics/metric.cpp:6487: All train targets are equal
  File "/home/fmerinocasallo/test_catboost_minimal.py", line 21, in fcast
    model.fit(series=train, future_covariates=future_cov)
  File "/home/fmerinocasallo/test_catboost_minimal.py", line 54, in <module>
    fcast(
_catboost.CatBoostError: /src/catboost/catboost/libs/metrics/metric.cpp:6487: All train targets are equal

The same CatBoostError raises if I call fcast with (output_chunk_length=4 and lags_future_covariates=[0, 1, 2, 3]) instead of (output_chunk_length=3 and lags_future_covariates=[0, 1, 2]).

If this is the expected behavior, could anyone explain why such minor differences result in this error? :thinking:

PS: Please, let me know if this is not the right place to post this issue regarding CatBoost :pray:

dennisbader commented 6 months ago

Hi @fmerinocasallo and thanks for raising this issue.

I tried to reproduce it, but for me your code snippet runs fine.

So I assume this might be a catboost (version?) issue.

My env:

fmerinocasallo commented 6 months ago

Thanks once again for your reply, @dennisbader, and apologies for not including the details of my working env in my original message :sweat_smile: I was using:

After updating python and darts to match your environment (python==3.10.13 and darts==0.27.1), I am still getting the very same CatBoostError :thinking:

I have defined a new conda environment using the following environment.yml:

name: test-catboost
dependencies:
- python==3.10.13
- u8darts-all==0.27.1
- catboost==1.2

and once again I am still getting the very same CatBoostError :exploding_head:

Is there something else I can check to solve this issue?

Update: I have created a shareable Google Colab Notebook to showcase this issue and explore potential solutions.

dennisbader commented 6 months ago

Ah sorry, my bad. I thought the first code snippet was the erroneous one.

The failing example results in the Y array shown below: You can see that the first column (model that predicts the first point in output_chunk_length has zeros only, hence all target values for training are the same. It seems like catboost doesn't except that, as it would not learn anything.

image

For the working example, all columns in the Y array have zeros, and at least one other value.

image
fmerinocasallo commented 6 months ago

Thanks for your reply, @dennisbader :smile: Let's see if I have understood the issue, my assumptions are correct and my conclusions are valid :thinking:

  1. Each row of the Y array stores the i-th sub-sample based on the output_chunk_length during the training period.
  2. CatBoost does not accept any scenario where the first column of the Y array is filled with zeros.

Therefore:

Here are some examples of the sub-samples from series1 associated with different values of input_chunk_length and output_chunk_length and its validity based on CatBoost requirements:

image

dennisbader commented 6 months ago

Hi @fmerinocasallo. I made some clarifications to what you wrote.

  1. Each row of the Y array stores the i-th sub-sample outputs/targets based on output_chunk_length during the training period. Each column stores the j-th output time step per sub-sample i. j goes from 1 until output_chunk_length.
  2. CatBoost does not accept any scenario where any column of the Y array has only one unique value.

Catboost is not complaining about zeros, but that there is only a single unique value in at least one of the y columns.

This problem should also just vanish if you increase your training set size.

We also recently added a notebook for the regression models that explain the lagged data extraction in more detail. You can find some visualizations here and here.

fmerinocasallo commented 5 months ago

Thanks for your clarifications once again, @dennisbader! :ok_hand:

An interesting example taking into account the requirement you mentioned (all columns in the Y array must include at least one non-zero value) would be one in which we replace the latest problematic definition of series:

    series = TimeSeries.from_series(
        pd.Series(
            [
                0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26.,
                32., 17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11.,
            ],
            index=idx,
            name="sales",
        )
    )

with one replacing just the second zero of the series by a non-zero value such as 1:

    series = TimeSeries.from_series(
        pd.Series(
            [
                0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26.,
                32., 17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11.,
            ],
            index=idx,
            name="sales",
        )
    )