jdb78 / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.77k stars 600 forks source link

Reduction to practice of N-BEATS #162

Closed lihub closed 3 years ago

lihub commented 3 years ago

First of all, amazing package! Looks like tons of work. Thanks so much.

I have spent a few days now getting to know it better, reading the documentation, and also running in debug to better understand what is going on.

I am trying to run N-BEATS on new data, and have a few questions regarding it: 1.a. Once I have a pre-trained model, what is the simplest way to predict on new data? 1.b. What is the minimal length of the data required? 1.c. What would be the behavior if the data is longer? Will the model "cut" just the required last samples and use them to predict?

  1. I need to train my model based on several series of different lengths. Building on top of the N-BEATS tutorial, I wrote a modified data generator in which I replaced the end generate_ar_data:

    insert into dataframe

    data = ( pd.DataFrame(series) .stack() .reset_index() .rename(columns={"level_0": "series", "level_1": "time_idx", 0: "value"})

with the following:

convert to dataframe, where the various series have different lengths

data = pd.DataFrame()
for k in range(series.shape[0]):
    truncate = np.random.randint(0, 20)
    if truncate > 0:
        truncated_data = series[k, :-truncate]
    else:
        truncated_data = series[k, :]
    new_df = pd.DataFrame({'series': k, 'time_idx': np.arange(len(truncated_data)), 'value': truncated_data})
    data = pd.concat([data, new_df], axis=0)
data.reset_index(drop=True, inplace=True)
return data

I executed synthetic_data_tutorial, and got the following exception: ValueError: Min encoder length and/or min prediction length is too large for 8 series/group

After some digging, I found out that it crashed in: validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff+1)

Digging deeper, I found out that when the series (meaning the different group id's) have different lengths, then the longest one will determine the parameters influencing the sequence length, start and end (in the df_index).

Consequently, after the "filter too short sequences" section, some group id's are being filtered out (although they have sufficient length for prediction).

I managed to circumvent this issue after making the following changes: In the init() of timeseries.py: I replaced: if min_prediction_idx is not None: data = data[lambda x: data[self.time_idx] >= self.min_prediction_idx - self.max_encoder_length] # before my fix, this was the only line in the clause

with: delta_per_group = data.groupby(self.group_ids)["time_idx"].max().max() - data.groupby('series')["time_idx"].max() inds_to_keep = np.zeros(shape=(data.shape[0],)).astype(bool) for k in delta_per_group.index: inds_to_keep = np.logical_or(inds_to_keep, np.logical_and(data[self.time_idx] >= self.min_prediction_idx - self.max_encoder_length - delta_per_group[k], np.squeeze(data[self.group_ids] == k))) data = data[inds_to_keep]

and in _construct_index() in the section of "#filter too short sequences": I replaced: (x["sequence_length"] + x["time"] >= self.min_prediction_idx + self.min_prediction_length) with: (x["sequence_length"] + x["time"] >= self.min_prediction_length + self.min_prediction_idx - (df_index['time_last'].max() - df_index['time_last']))

df_index now looks exactly as I thought it should, and indeed it passed: train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0) and also: val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)
but it crashed in the next line:

Traceback (most recent call last): File "D:\Users\Lihu\Dropbox\Projects\Maytronics\pytorch_prediction\venv\lib\site-packages\IPython\core\interactiveshell.py", line 2731, in safe_execfile self.compile if shell_futures else None) File "D:\pytorch_prediction\venv\lib\site-packages\IPython\utils\py3compat.py", line 168, in execfile exec(compiler(f.read(), fname, 'exec'), glob, loc) File "D:\pytorch_prediction\synthetic_data_tutorial.py", line 63, in actuals = torch.cat([y for x, y in iter(val_dataloader)]) File "D:\pytorch_prediction\synthetic_data_tutorial.py", line 63, in actuals = torch.cat([y for x, y in iter(val_dataloader)]) File "D:\pytorch_prediction\venv\lib\site-packages\torch\utils\data\dataloader.py", line 363, in next data = self._next_data() File "D:\pytorch_prediction\venv\lib\site-packages\torch\utils\data\dataloader.py", line 403, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "D:\pytorch_prediction\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "D:\pytorch_prediction\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "D:\pytorch_prediction\venv\lib\site-packages\pytorch_forecasting\data\timeseries.py", line 932, in getitem ), "Decoder length should be at least minimum prediction length" AssertionError: Decoder length should be at least minimum prediction length

I'd definitely appreciate some help from someone who knows the code much better than me :) Thanks, Lihu

jdb78 commented 3 years ago

1.a. Once I have a pre-trained model, what is the simplest way to predict on new data?

You can use the model's predict() method and pass a dataloader with the new data.

1.b. What is the minimal length of the data required?

This depends on the data but for N-Beats, it would be two observations (min_encoder_length=1, min_prediction_length=1). However, this is just a technical minimum and you want probably more than 1/10 of the data in the encoder and half of that for decoding. What works particularly well with N-BEATS is creating an ensemble of models with different encoder lengths. You would have to do this manually as it is not directly implemented in PyTorch Forecasting.

1.c. What would be the behavior if the data is longer? Will the model "cut" just the required last samples and use them to predict?

Yes. The TimeSeriesDataSet will cut the data after max_encoder_length + max_prediction_length. Do not set them to unreasonable long values as performance on the test set will suffer if lengths are different from the training set.

I need to train my model based on several series of different lengths. Building on top of the N-BEATS tutorial, I wrote a modified data generator in which I replaced the end generate_ar_data:

truncate = np.random.randint(0, 20)

This is pretty short for a sequence with seasonality. I wonder if NBEATS will pick it up.

AssertionError: Decoder length should be at least minimum prediction length

The sequence length has to be at least min_encoder_length + min_prediction_length. The min_prediction_idx should not be the issue as it should be set to 0 by default (the minimum time index in the entire dataset). You can also pass it manually to the TimeSeriesDataSet constructor and set it to a number of your choosing.

I hope this is helping. Let me know if those pointers solve your problem.

lihub commented 3 years ago

Hi Jan, Thanks for getting back! A few follow up questions:

1.a. Once I have a pre-trained model, what is the simplest way to predict on new data? You can use the model's predict() method and pass a dataloader with the new data. Could share a short snippet on how exactly a "test" (not related to or extracted from a "training" dataset) dataloader should be created?

1.c. truncate = np.random.randint(0, 20) This is pretty short for a sequence with seasonality. I wonder if NBEATS will pick it up. The truncation I introduced just removes the last few samples from the generated series, to artificially cause them to differ in length (my actual data have such series). Every series in itself is still hundreds of samples long.

_The min_predictionidx should not be the issue as it should be set to 0 by default (the minimum time index in the entire dataset). You can also pass it manually to the TimeSeriesDataSet constructor and set it to a number of your choosing. The N-BEATS tutrial contains: validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff+1) where training_cutoff = data["time_idx"].max() - max_prediction_length. What am I missing here?

Thanks again.

jdb78 commented 3 years ago

1.a. Once I have a pre-trained model, what is the simplest way to predict on new data? You can use the model's predict() method and pass a dataloader with the new data. Could share a short snippet on how exactly a "test" (not related to or extracted from a "training" dataset) dataloader should be created?

You always should the use the scalers that have been created in the training dataset because the model will only work if the new data is scaled as the old one. HOWEVER, if your concern is the large training dataset, there is a convenient solution. You can use the training dataset parameters with params = training.get_parameters() and initialize your test dataloader from these with TimeSeriesDataSet.from_parameters(params, **kwargs). You can define in kwargs parameters that you want to override in the newly created dataset. Keep in mind that changing some parameters will make your dataset incompatible with the model (e.g. changing max_prediction_length has to be the same for NBeats as it defines its architecture).

1.c. truncate = np.random.randint(0, 20) This is pretty short for a sequence with seasonality. I wonder if NBEATS will pick it up. The truncation I introduced just removes the last few samples from the generated series, to artificially cause them to differ in length (my actual data have such series). Every series in itself is still hundreds of samples long.

Sorry, did not read your post carefully enough. This should be fine.

_The min_predictionidx should not be the issue as it should be set to 0 by default (the minimum time index in the entire dataset). You can also pass it manually to the TimeSeriesDataSet constructor and set it to a number of your choosing. The N-BEATS tutrial contains: validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training_cutoff+1) where training_cutoff = data["time_idx"].max() - max_prediction_length. What am I missing here?

min_prediction_idx is just a convenience way of making a cut through time to separate training and validation. This is what you often want. By no means, you have to use it to define your validation dataset. A more manual, but completely legit approach is to filter data. For each time series filter it so that only time points are included that are not in the training data set + the encoder length of each series that is required to create some history.

lihub commented 3 years ago

Great! This was what I was missing. Thanks :)

dvirginz commented 2 years ago

ave to do this manually as it is not directly implemented in

Thank you very much for the thorough reply. You mention ensembles. Do you have a reference to a snippet creating ensembles of N-BEATS? Thanks.