Closed fmerinocasallo closed 5 months ago
Hi @fmerinocasallo and thanks for raising this issue.
I tried to reproduce it, but for me your code snippet runs fine.
So I assume this might be a catboost (version?) issue.
My env:
Thanks once again for your reply, @dennisbader, and apologies for not including the details of my working env in my original message :sweat_smile: I was using:
After updating python and darts to match your environment (python==3.10.13 and darts==0.27.1), I am still getting the very same CatBoostError
:thinking:
I have defined a new conda environment using the following environment.yml
:
name: test-catboost
dependencies:
- python==3.10.13
- u8darts-all==0.27.1
- catboost==1.2
and once again I am still getting the very same CatBoostError
:exploding_head:
Is there something else I can check to solve this issue?
Update: I have created a shareable Google Colab Notebook to showcase this issue and explore potential solutions.
Ah sorry, my bad. I thought the first code snippet was the erroneous one.
The failing example results in the Y array shown below:
You can see that the first column (model that predicts the first point in output_chunk_length
has zeros only, hence all target values for training are the same. It seems like catboost doesn't except that, as it would not learn anything.
For the working example, all columns in the Y array have zeros, and at least one other value.
Thanks for your reply, @dennisbader :smile: Let's see if I have understood the issue, my assumptions are correct and my conclusions are valid :thinking:
Y
array stores the i-th sub-sample based on the output_chunk_length
during the training period.Y
array is filled with zeros.Therefore:
series1
or any series adding zeros to the beginning of series1
, CatBoost does not accept output_chunk_length > 3
but do allow any value of input_chunk_length
.Here are some examples of the sub-samples from series1
associated with different values of input_chunk_length
and output_chunk_length
and its validity based on CatBoost requirements:
Hi @fmerinocasallo. I made some clarifications to what you wrote.
i
-th sub-sample outputs/targets based on output_chunk_length
during the training period. Each column stores the j
-th output time step per sub-sample i. j
goes from 1 until output_chunk_length
.Catboost is not complaining about zeros, but that there is only a single unique value in at least one of the y
columns.
This problem should also just vanish if you increase your training set size.
We also recently added a notebook for the regression models that explain the lagged data extraction in more detail. You can find some visualizations here and here.
Thanks for your clarifications once again, @dennisbader! :ok_hand:
An interesting example taking into account the requirement you mentioned (all columns in the Y
array must include at least one non-zero value) would be one in which we replace the latest problematic definition of series
:
series = TimeSeries.from_series(
pd.Series(
[
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26.,
32., 17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11.,
],
index=idx,
name="sales",
)
)
with one replacing just the second zero of the series by a non-zero value such as 1
:
series = TimeSeries.from_series(
pd.Series(
[
0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 13., 26.,
32., 17., 12., 12., 5., 10., 18., 19., 27., 10., 9., 11.,
],
index=idx,
name="sales",
)
)
I am using
CatBoostModel
and the following code snippet does work without any issues::warning: However, if I replace the original definition of
series
:with one including an additional zero at the beginning of the series and without the last 15:
I get the following error:
The same
CatBoostError
raises if I callfcast
with (output_chunk_length=4
andlags_future_covariates=[0, 1, 2, 3]
) instead of (output_chunk_length=3
andlags_future_covariates=[0, 1, 2]
).If this is the expected behavior, could anyone explain why such minor differences result in this error? :thinking:
PS: Please, let me know if this is not the right place to post this issue regarding CatBoost :pray: