Open ian-grover opened 1 year ago
I kind of figured a solution.
My problem is that when I setup the dataset, the categorical data type assesses all the data in that column and specifies what is available and attaches the categories to it. When we pass this to the timeseriesdataset, it also looks at all the categories present in the dataset and uses this to train the embeddings.
In my case, months 3-7 are in the training dataset, month 8 is in the test dataset. If I do not have a NaN label encoder, it will complain about month 8 being in the category but not present. If I have the NaN label encoder, it knows to set up a NaN encoding, but never sees the data which it refers to because it never creates a NaN category.
The solution is to fix the category data type for my dataframe column to only specify the categories which are entering the training dataset, so that it will then set month 8 to NaN, and when my testing dataset is used, it will instead use the NaN encoding.
It feels like there should be a better solution, for instance, if it knows month 8 is present in the dataset, to create an embedding much like the NaN label embedding. An alternative is a better error message which signals that there is a category which was not present in the training which needs to be set to NaN.
Expected behavior
I have trained a TFT using TimeSeriesDataset and I now wish to reduce the size of my training data, limit my validation data, run predictions on a dataset which has not been seen at all by any training.
I have followed the stallion example, and I decided to reduce my training cutoff by a larger number than just the historical data length + prediction length. I have included the month of my data as a categorical input, and I am using the NaNLabelEncoder for unseen data. When I run a prediction on unseen data, I get an embedding error. After investigating the source code, I realise it is because my training only saw months categorised as 0 to 5, but my prediction data has a month categorised as 6.
Actual behavior
I thought the NaNLabelEncoder should handle this, but my embedding tensor looks like
but my weight tensor only has length 6 (ie index 0 to 5) so I get the error
I thought the NaNLabelEncoder should be handling this issue, but it seems like it might not be helping when running predictions (rather than training)?
Could someone advise if I am missing something in my setup or if I should be masking unseen categorical data in a test dataset in some way?
Code to reproduce the problem
My timeseries initialisation looks like