Open andre-marcos-perez opened 1 year ago
I suspect the training TimeSeriesDataSet GroupNormalizer
might be the problem. It seems it looks to the fitted normalisation params to normalize the data (transform
), but the entity key won't be there since it's brand new (unseen). Actually, even if you set target_normalizer
to None
, behind the scenes the Torch.transform("identity")
is assigned, and it also loads the entity keys when fitting the transformation.
Which is the best approach to make predictions at inference time as stateless as possible? I am wondering if I have to build the TimeSeriesDataSet
using both training and inference data, I won't be able to make this model run prediction in live
mode, right?
Some updates, I noticed that the TimeSeriesDataSet
adds the group_ids
columns on the categorical_encoders
here. It happens inside the _preprocess_data
method that is called when the object is created. This seems to be the source of the problem since the groups are learned at training
time and newer groups at inference
time won't go through. This effectively explains the error since the default encoder appended to the group ids columns is sklearn's LabelEncoder.
KeyError: "Unknown category '<unseen-entity-key>' encountered. Set `add_nan=True` to allow unknown categories"
Ok, I found the root cause. The quoted comment only happens when we set a GroupNormalizer
to the TimeSeriesDataSet.target_normalizer
param.
Some updates, I noticed that the
TimeSeriesDataSet
adds thegroup_ids
columns on thecategorical_encoders
here. It happens inside the_preprocess_data
method that is called when the object is created. This seems to be the source of the problem since the groups are learned attraining
time and newer groups atinference
time won't go through. This effectively explains the error since the default encoder appended to the group ids columns is sklearn's LabelEncoder.KeyError: "Unknown category '<unseen-entity-key>' encountered. Set `add_nan=True` to allow unknown categories"
I can think of two solutions to decouple training and inference in this context, both require target data to be normalised elsewhere:
TimeSeriesDataSet.target_normalizer
param to None
. This approach might require training and inference target data to be normalised elsewhere. You can always drop target normalisation altogether.# training stuff
training_ts_dataset: TimeSeriesDataSet = TimeSeriesDataSet(
data=training_df, # target normalised elsewhere
...
target_normalizer=None,
...
)
# inference stuff
inference_ts_dataset = TimeSeriesDataSet.from_parameters(
data=inference_df, # target normalised elsewhere
parameters=training_ts_dataset.get_parameters(),
predict=True
)
TimeSeriesDataSet.target_normalizer
, explicitly add the group id columns to the categorical_encoders
but allow unknown categories. This approach requires inference target data to be normalised elsewhere.# training stuff
training_ts_dataset: TimeSeriesDataSet = TimeSeriesDataSet(
...
target_normalizer=GroupNormalizer(groups=[<entity-key>], transformation="softplus"),
...
categorical_encoders={'<entity-key>': NaNLabelEncoder(add_nan=True)},
...
)
# inference stuff
inference_ts_dataset = TimeSeriesDataSet.from_parameters(
data=inference_df, # target normalised elsewhere
parameters=training_ts_dataset.get_parameters(),
predict=True
)
Why would you want to use group normalizer the default is EncoderNormalizer if the context length is above a certain threshold and based on my simulations, it performs better than GroupNormalizer and normalizing per time series.
Actually in our experiments GroupNormalizer tends to perform better than the EncoderNormalizer. Unfortunately, this means that we cannot do inference for the unseen groups.
One thing to consider though is that in the linked issue it seems that the GroupNormalizer introduces leakage, which probably means that the evaluation produces an optimistic result for the GroupNormalizer.
Expected behaviour
I get predictions at inference time on unseen entity keys.
Actual behaviour
I get an error saying that the entity key is an unknown category.
Code to reproduce the problem
Hi, I am struggling to understand how to get predictions at inference time when entity keys are not present in the training time series datasets. The following pseudo-code:
Throws this error stating that the unseen entity key is an unknown category (except it is a group id). I actually only know that we are taking about the entity key because I know the payload.