KoustavDS commented 1 year ago

Description:

I am trying to run a global model (NBEATS/TFT) with 5000 time-series having 365 timestamp each.Can use the below code to create the data. I am running this in 2 different GCP instances.

Instance 1 configuration : 16 vCPUs, 60 GB RAM, NVIDIA T4 (1 GPU).

I am getting "out of memory" error(copying below). As suggested in the error message, I tried to max_split_size_mb = 128,256 etc. but it did not resolve the problem

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.58 GiB total capacity; 14.36 GiB already allocated; 1.31 MiB free; 14.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Instance 2 configuration : 16 vCPUs, 60 GB RAM, NVIDIA V100 (4 GPU).

It is not throwing "out of memory" error but giving error related to Pytorch lightning GPU allocation(copying below).

*RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.**

Code to reproduce the data :

import pandas as pd import numpy as np from darts.dataprocessing import Pipeline from darts.metrics import mape, smape, rmse from darts.utils.statistics import check_seasonality, plot_acf, plot_residuals_analysis from darts.utils.timeseries_generation import linear_timeseries from darts.datasets import MonthlyMilkDataset, MonthlyMilkIncompleteDataset from darts.models import NBEATSModel

from statsmodels.tools.eval_measures import rmse

from sklearn.preprocessing import MaxAbsScaler

new_arry = np.array(np.random.randint(1000,size=(365,5000))) new_data = pd.DataFrame(new_arry) newdata.columns = ['col' + str(i) for i in new_data.columns] new_data['col_dt'] = pd.date_range(start='1/1/2022', periods=len(new_data), freq='D')

pvt_samp2 = TimeSeries.from_dataframe(new_data,'col_dt',new_data.columns[:5000].tolist())

filler = MissingValuesFiller(fill = 'auto') pvt_samp2 = filler.transform(pvt_samp2)

transformer = Scaler(scaler=MaxAbsScaler()) pvt_samp2 = transformer.fit_transform(pvt_samp2)

new_cov = new_data.copy() new_cov['day'] = new_cov.col_dt.dt.day new_cov['month'] = new_cov.col_dt.dt.month

new_cov = new_cov[['col_dt','day','month']] new_cov = TimeSeries.from_dataframe(new_cov,'col_dt',new_cov.columns[1:].tolist()) scaler_dt_cov = Scaler() final_cov = scaler_dt_cov.fit_transform(new_cov)

train, val = pvt_samp2.split_after(pd.Timestamp("20221221")) train_cov, val_cov = final_cov.split_after(pd.Timestamp("20221221"))

from darts.models import TFTModel import torch

my_model = TFTModel( input_chunk_length=90, output_chunk_length=10, hidden_size=32, lstm_layers=1, num_attention_heads=3, dropout=0.2, batch_size=300, n_epochs=4, add_relative_index=False, add_encoders=None, likelihood=None,

likelihood=QuantileRegression(

#    quantiles=quantiles
#),  # QuantileRegression is set per default
loss_fn=torch.nn.MSELoss(),
random_state=42,

)

my_model.fit(train, future_covariates=final_cov, verbose=True)

Expected behavior

How to resolve these errors. Am I passing too much data processing for 1 GPU? In that case, I am adding more GPUs bu getting error Pytorch lightning error. How to solve this issue. Suggestions to scale from 5000 to 10000 timeseries in one model?

System (please complete the following information):

python version : 3.10 Darts version : 0.25.0

KoustavDS commented 1 year ago

Hi Unitco team, request your help to solve the above issue.

dennisbader commented 1 year ago

Hi @KoustavDS and sorry for the late response. It looks like you created a multivariate target series (the series you want to make forecasts for) with 5000 components (columns).

This will end up creating a multivariate TFTModel with 5000 output dimensions - a huge model which is likely why you end up running into memory issues.

I believe what you want to do is create a univariate TFTModel (one output dimension) and train it on 5000 univariate (one column) target series. For this you just have to create a list of time series with one column each, and then feed this list to model.fit() and predict().

You can read more about difference between multivariate and multiple series in this guide. And here an example for forecasting with multiple series.

KoustavDS commented 1 year ago

I will try this approach and get back. Please do let me know if I want to add exogenous feature(covariate) like weekday, should I need to create one exogenous feature list for each time-series and pass on? In that case I need to pass 5000 covariate series.

dennisbader commented 1 year ago

Yes, thats correct. One covariate series per target series. The covariates themselves can be multivariate (e.g each of the 5000 covariate series can have mutliple colums/features such as weekday, month, …)

KoustavDS commented 1 year ago

Thank you. Also it would be great if you kindly explain me this --> what is the difference while we are running a set of 5K series in a dataframe (multivariate) and running 5K individual series in NBEATs or TFT.

What I understood is, it will break the data as per input_chunk and output_chunk and try to create multiple series out of single series and then it will generalize for all 5K. In both the cases, we will get 5K output as prediction. In that case, how are these 2 ways are different in terms of algorithm as well as memory usage.

Also if you can help me with the second error which I mentioned above while using multiple GPUs.

dennisbader commented 1 year ago

Let's just look at a simplified example, and ignore the time dimension of the Darts model. Let's say an input batch to a model is a tensor with shape (batch size, number of features).

Let's say in the multivariate model our batch size is 100, and our multivariate target series has 5000 columns/features. Our input batch will have shape (100, 5000). All layers in the model that depend on the number of input features will also get bigger -> the model size increases.
In the univariate model our batch size is 100 and we have 5000 univariate series with 1 column/feature. Our input batch will have shape (100, 1). The model size stays small because we only have one feature. The only thing that increases is the number of batches that we have to go through -> no memory issues.

For the multi GPU question:

Try following this guide
Some users face another issue with multi-GPU support, see this issue

KoustavDS commented 1 year ago

I have tried passing data through list and it is working.Thank you @dennisbader.

dennisbader commented 1 year ago

No worries @KoustavDS, did it also work in the Multi-GPU scenario?

Would be helpful to reassure that the issue I mentioned above is not affecting all users.

KoustavDS commented 1 year ago

Hi @dennisbader ..Yes, I was able to solve the multi-GPU scenario as well. It has limitation, it will only work when we run the code from terminal with .py format. It does not run from Notebook.

Also I was trying to add holiday, weekday as covariates. I am able to add weekday but not holiday. Here is code and the error. If you know about this error, pls do let me know. Thanks.

CODE: cov_month = datetime_attribute_timeseries(new_series1, attribute="month") cov_day = datetime_attribute_timeseries(new_series1, attribute="day") cov_holiday = holidays_timeseries(new_series1,country_code = 'US')

ERROR: **_573 time_index = _extend_time_index_until(time_index, until, add_length) --> 574 scope = range(time_index[0].year, (time_index[-1] + pd.Timedelta(days=1)).year) 575 country_holidays = holidays.country_holidays( 576 country_code, prov=prov, state=state, years=scope 577 ) 578 index_series = pd.Series(time_index, index=time_index)

AttributeError: 'TimeSeries' object has no attribute 'year'_**

dennisbader commented 1 year ago

Good to hear that it multi-GPU worked @KoustavDS.

Can you open a new issue for the holidays, so I can close this one?

Also, try to make give a minimal reproducible example for new_series1, thanks!

KoustavDS commented 1 year ago

Sure...I created a new issue. https://github.com/unit8co/darts/issues/2022#issue-1931192569

unit8co / darts

[BUG] #2010

Description:

Instance 1 configuration : 16 vCPUs, 60 GB RAM, NVIDIA T4 (1 GPU).

Instance 2 configuration : 16 vCPUs, 60 GB RAM, NVIDIA V100 (4 GPU).

Code to reproduce the data :

from statsmodels.tools.eval_measures import rmse

likelihood=QuantileRegression(

Expected behavior