sb-ai-lab / LightAutoML

Fast and customizable framework for automatic ML model creation (AutoML)
https://developers.sber.ru/portal/products/lightautoml
Apache License 2.0
1.08k stars 47 forks source link

How to use TimeSeriesIterator? #152

Closed fingoldo closed 1 month ago

fingoldo commented 1 month ago

Question

My dataset is ordered by time and usual KFOLD cross-validation results in poor test performance. How do I use ts-based cross-validation? I noticed there is TimeSeriesIterator in LAMA, but no example of using it anywhere.

I tried


clf = TabularAutoML(
    task=Task("binary", loss="logloss", metric="auc"),
    timeout=60 * 60 * 3,
    memory_limit=90,
    cpu_limit=16,
    reader_params=dict(
        n_jobs=1,
        cv=lightautoml.validation.np_iterators.TimeSeriesIterator(
            datetime_col=df.loc[train_idx, "date"],
            n_splits=3,
        ),
    ),
)

oof_preds = clf.fit_predict(df.loc[train_idx], roles={"target": "target"}, verbose=3)

but get

File R:\ProgramData\anaconda3\Lib\site-packages\lightautoml\transformers\categorical.py:445, in TargetEncoder.fit_transform(self, dataset) 442 f_sum = np.zeros(n_folds, dtype=np.float64) 443 f_count = np.zeros(n_folds, dtype=np.float64) --> 445 np.add.at(f_sum, folds, target) 446 np.add.at(f_count, folds, 1) 448 folds_prior = (f_sum.sum() - f_sum) / (f_count.sum() - f_count)

ValueError: array is not broadcastable to correct shape

What's the correct way of using TimeSeriesIterator?

alexmryzhkov commented 1 month ago

Hi @fingoldo,

We have the demo example how to use it in the right way.

Alex

fingoldo commented 1 month ago

Hi @fingoldo,

We have the demo example how to use it in the right way.

Alex

Thanks a lot, Alexander! Passing it as cv_iter parameter to fit_predict did the trick. But the reader object also has the cv parameter. How do cv of the reader and cv_iter of TabularAutoML itselt interplay? it's not clear from the docs. Maybe documentation of TimeSeriesIterator can be improved by at least referencing the demo example script? It's really hard to find.

alexmryzhkov commented 1 month ago

@fingoldo to figure out what is cv and any other parameter means during TabularAutoML preset creation please check our fully commented YAML config which helps LightAutoML figure out what to do.

To be clear, cv param means the number of folds for cross-validation.

And yes, you are right that our documentation is not 100% clear, we are working on it.

Alex

fingoldo commented 1 month ago

Mm i'm still confused.

To be clear, cv param means the number of folds for cross-validation.

But what happens when I specify both cv=3 for the reader, and pass cv_iter=TimeSeriesIterator(datetime_col=df.loc[train_idx,"date"],n_splits=5) to fit_predict?

which of the splitters will be used, Kfold with 3 or TimeSeries with 5 splits?

alexmryzhkov commented 1 month ago

@fingoldo cv_iter (if explicitly specified) should overwrite cv param