code crash jupyter kernel

microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

https://microsoft.github.io/FLAML/

MIT License

3.91k stars 508 forks source link

code crash jupyter kernel #424

Closed Tint0ri closed 2 years ago

Tint0ri commented 2 years ago

try to run a s simple ts forecast code on Google Colab(12G RAM) and Amzon Sagemaker(16G RAM). Both environment get jupyter kernel crashed and restarted report after a while.

sample code:

import numpy as np import pandas as pd from flaml import AutoML

X_train = pd.date_range('2017-01-01', periods=70000, freq='T') y_train = pd.DataFrame(np.random.randint(6500, 7500, 70000)) automl = AutoML() automl.fit(X_train=X_train[:-10].values, # a single column of timestamp y_train=y_train[:-10].values, # value for each timestamp period=10, # time horizon to forecast, e.g., 12 months task='ts_forecast', time_budget=8000, # time budget in seconds )

No error output but kernel restarted and training process stopped. Maybe caused by memory management issue.

sonichi commented 2 years ago

I believe the root cause is the same as #412 . I suggest the same workaround before the issue is fixed. @int-chaos

Tint0ri commented 2 years ago

I had the #412 issue before. But I can bypass it by change period from 12 to a smaller number, such as 10 as above. My code can finish if time_budget is a small number. But if I set time_budget greater than 2000, I got kernel crashed. I assumed #424 should be something about memory management such as not retrieve the memory in time. Because Colab log show memory tmalloc was not fulfilled.

int-chaos commented 2 years ago

To address this memory issue, we will be setting a limit for the upper bound of lags. But what exactly should be the limit is to be determined. If you have any suggestions for it please let me know. Below is an explanation of what is happening and why this error occurs.

Currently our upper bound for the lags hyperparameter is dependent on the size of the data_size and period, where the upper bound is equal to data_size - period. It seems like you have a really large training set with a data size of 69990 timestamps and you're forecasting a period of 10. With a split of 5, this means that the upper bounds of lags are 69970, 69960, 69950, 69940 and 69930 for your five validation sets. When you increase the time_budget to be greater than 2000 it will explore a lag value that approaches the upper limit. It seems like it gets to a point where the number of lag features is too great for the memory.

sonichi commented 2 years ago

And that's why the same workaround in https://github.com/microsoft/FLAML/issues/412#issuecomment-1014920806 would avoid this issue.

int-chaos commented 2 years ago

And that's why the same workaround in #412 (comment) would avoid this issue.

Yes that would since those learners ['prophet', 'arima', 'sarmiax'] do not handle lags the same way the sklearn regressors do.

Prophet does not have lags
ARIMA/SARIMA use d parameter as lags

Tint0ri commented 2 years ago

Got. Thanks for clarify. For the upper bound of lags, how about set to the square root of data_size?

int-chaos commented 2 years ago

Thank you for the suggestion and I believe that it is a reasonable limit as well, as it encompasses both cases of KPSS test for find the optimal lag value and its function prosperities allow it to work as a good limit.

For me: something to keep in mind... the problem might still exist with large datasets, short horizon and cv split since the data_size that is passed into the search space is the size of X_train not for the validation set