How to use group_ids correctly? (N-Beats)

LukeOliv commented 3 years ago

I've been studying this marvellous looking library for awhile and think it holds great promise. Thanks for doing this!

Even though pytorch-forecasting makes things a lot easier I'm still in the dark on some aspects. The biggest one currently is group_ids (separating timeseries) and how to use them properly with unforeseen (real world) data.

I first started testing with everything clumped up in a single group_id and it seemed quite easy to make real world predictions by just making a new dataframe with dummy rows in the end for predictions and using the model predict method.

However the data I use has irregular gaps.

1) I assume you are meant to group each continuous series of the same data?

I made a script to automatically create groupings for the data. Using this grouped data to train a model and create validation data for prediction tests proved to be more complex, see point 2 and 3 below.

I also ran into the thing in issue #121, so some shorter groups will get discarded.

2) When using unforeseen data, how should you group it?

If you use a unforeseen group number you will get "Unknown category" error. Adding add_nan=True to the NaNLabelEncoder fixes that but doesn't help with the group being discarded as with issue #121 unless you make this new group big enough (would that even give predictions with so much dummy data?).

3) Why do you get a different kind of prediction tensor when using multiple group_ids?

When I use a single group id for all data I get a neat prediction with the expected amount of predictions. However when using groups I seem prediction data for all the groups present in the data.

I've cutnpasted relevant parts of my code below and can share a full notebook and/or data if anyone wants it. I've tried to keep things as simple as possible to start out with.

max_encoder_length = 8
max_prediction_length = 2

context_length = max_encoder_length
prediction_length = max_prediction_length

datafile_path = "localfile.csv"
headers = ['index','series','time_idx', 'date', 'value']
dtypes = {'index': 'int64', 'series': 'int64', 'time_idx': 'int64', 'date': 'str', 'value': 'float'}
parse_dates = ['date']
f = lambda s: datetime.strptime(s,'%Y-%m-%d %H:%M:%S')

sourcedata = pd.read_csv(datafile_path,sep='\t',header=None,names=headers,index_col=0,parse_dates=parse_dates, dtype=dtypes, date_parser=f)

trainingdata = sourcedata

n = 178     #How many rows to cut off for validation

allpredictiondata = trainingdata.tail(n)     #The data we would like to do validation test with

trainingdata.drop(trainingdata.tail(n).index, inplace = True)

intAppendRows = prediction_length #How many dummy rows to add

training_cutoff = trainingdata["time_idx"].max() - max_prediction_length

context_length = max_encoder_length
prediction_length = max_prediction_length

training = TimeSeriesDataSet(
    trainingdata[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="value",
    categorical_encoders={"series": NaNLabelEncoder().fit(trainingdata.series)},
    group_ids=["series"],
    # only unknown variable is "value" - and N-Beats can also not take any additional variables
    time_varying_unknown_reals=["value"],
    min_prediction_idx=0,
    max_encoder_length=context_length,
    max_prediction_length=prediction_length
)

validation = TimeSeriesDataSet.from_dataset(training, trainingdata, min_prediction_idx=training_cutoff+1)
batch_size = 72
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=0)

pl.seed_everything(42)
trainer = pl.Trainer(gpus=1, gradient_clip_val=0.1)
net = NBeats.from_dataset(training, learning_rate=3e-2, weight_decay=1e-2, widths=[32, 512], backcast_loss_ratio=1.0)

#find optimal learning rate
res = trainer.tuner.lr_find(net, train_dataloader=train_dataloader, val_dataloaders=val_dataloader, min_lr=1e-5)
print(f"suggested learning rate: {res.suggestion()}")
net.hparams.learning_rate = res.suggestion()

early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")

#I use a GPU for training
trainer = pl.Trainer(
    max_epochs=100,
    gpus=1,
    weights_summary="top",
    gradient_clip_val=0.1,
    callbacks=[early_stop_callback],
    limit_train_batches=30,
)

net = NBeats.from_dataset(
    training,
    learning_rate=4e-3,
    log_interval=10,
    log_val_interval=1,
    weight_decay=1e-2,
    widths=[32, 512],
    backcast_loss_ratio=1.0,
)

trainer.fit(net, train_dataloader=train_dataloader, val_dataloaders=val_dataloader)

best_model_path = trainer.checkpoint_callback.best_model_path

best_model = NBeats.load_from_checkpoint(best_model_path)

predictiondata = allpredictiondata.head(intAppendRows)

#cut piece of data for simulation purposes, in real life this and other things would be rolled forward for bigger tests
padrows = predictiondata.tail(intAppendRows)
predictiondata.drop(predictiondata.tail(intAppendRows).index,
        inplace = True)

#Reset the dummy rows (just to mimic a real situation)
padrows['value'] = 0.0
padrows['series'] -= 1 #So you don't use unforeseen group

#append dummy rows back to set
predictiondata = predictiondata.append(padrows)

#Append the validation data to the end of training data to avoid "filters should not remove entries" error
predictiondata = trainingdata.append(predictiondata)

#The 'real world' prediction
abc = TimeSeriesDataSet.from_dataset(training, predictiondata, predict=True, stop_randomization=True)
testing_sample = abc.to_dataloader(train=False)
raw_predictions = best_model.predict(testing_sample)

jdb78 commented 3 years ago

I assume you are meant to group each continuous series of the same data?

Yes, exactly. The group_ids identify together a time series.

I also ran into the thing in issue #121, so some shorter groups will get discarded.

There is no trivial real workaround. NBeats requires a certain encoder_length. You could use another model that is more flexible in how many encoder time steps are required or train multiple models with various encoder lengths.

When using unforeseen data, how should you group it?

If you use a unforeseen group number you will get "Unknown category" error. Adding add_nan=True to the NaNLabelEncoder fixes that but doesn't help with the group being discarded as with issue #121 unless you make this new group big enough (would that even give predictions with so much dummy data?). Why do you get a different kind of prediction tensor when using multiple group_ids?

The issue should be fixed in the lastest version. No need to add_nan=True to the NaNLabelEncoder.

When I use a single group id for all data I get a neat prediction with the expected amount of predictions. However when using groups I seem prediction data for all the groups present in the data.

Not sure what you exactly mean. Let me try to point to potential issues in your code.

allpredictiondata = trainingdata.tail(n) #The data we would like to do validation test with

Here, you will select some random data if trainingdata is not sorted by time_idx and there is only one time series. It is not a surprise that suddenly your time series are very short when trying to do predictions. Your prediction dataframe should for each group contain at least least contain max_encoder_length + max_prediction_length time steps. The same is true for all other .head() and .tail() operations. Maybe .sort_values("time_idx").groupby().tail() can do the trick. I suggest, to run the following checks before creating a dataset for inference in addition to checking a couple of examples by hand.

assert (data.groupby(group_ids).size() >= max_encoder_length + max_prediction_length).all(), "time series are too short"
assert (data.groupby(group_ids + ["time_idx"]).size() ==1, "time index should only occur once per time series"

categorical_encoders={"series": NaNLabelEncoder().fit(trainingdata.series)},

This will in fact do nothing in the time series dataset because you do not use the series as feature. You can safely get rid of the line.

padrows['series'] -= 1 #So you don't use unforeseen group

Not sure why this is necessary. With 0.7.1, the issue should be fixed.

For comparing the predictions with actual values, see #224.

I hope this is helpful!

LukeOliv commented 3 years ago

It's indeed very helpful! Thank you very much for the reply. I can continue my experiments with the new info.

ruuttt commented 2 years ago

assert (data.groupby(group_ids).size() >= max_encoder_length + max_prediction_length).all(), "time series are too short"
assert (data.groupby(group_ids + ["time_idx"]).size() ==1, "time index should only occur once per time series"

Fix for small typo in code above (see bold part):
assert (data.groupby(group_ids + ["time_idx"]).size() ==1 ).all(), "time index should only occur once per time series"

@jdb78 , would be great if these two checks would be incorporated in the PyTorch forecasting source code.

jdb78 / pytorch-forecasting

How to use group_ids correctly? (N-Beats) #222