Closed diditforlulz273 closed 3 years ago
You have to explicitly allow unknown categories (see dropout_categoricals
in https://pytorch-forecasting.readthedocs.io/en/latest/api/pytorch_forecasting.data.timeseries.TimeSeriesDataSet.html#pytorch_forecasting.data.timeseries.TimeSeriesDataSet). Would be a great PR to improve documentation. Essentially, this sets the embedding to a zero-vector. However, if you have a lot of categories that are not in the training set, it is questionable if you want to include the variable for training.
I can think of three improvements to the current approach. Feel invited to submit a PR
Thanks for pointing me into docs! :+1: I'll investigate it further, and return with a PR into documentation if everything works well or Issue with a possible bug. Now I still can't get stable behavior on my dataset with a simple tail-cut cross-validation procedure
@jdb78 I'm still experiencing problems. Not creating a new issue - seems that my question is closely related to the current topic. As I wrote before, I have a grocery sales dataset where goods are coded by item_id and store_id. As it is the real data, some ids are jumping in and some hopping out literally every week, so I set dropout_categoricals to item_id explicitly. See what happens next. Upon creation of the first, cutted dataset with code:
training_cutoff = data["time_idx"].max() - max_prediction_length # length=28
training = TimeSeriesDataSet(
data[lambda x: x.time_idx <= training_cutoff], ...
everything goes fine with no errors and warnings. All item_id and store_id get encoded into [1...len(column)] range internally.
Next I do: vs = TimeSeriesDataSet.from_dataset(ts, data, predict=True, stop_randomization=True)
and what I get is: File "... venv/lib/python3.8/site-packages/pytorch_forecasting/data/timeseries.py", line 759, in _construct_index assert ( AssertionError: Time difference between steps has been idenfied as larger than 1 - set allow_missings=True
Remarks:
I guess this is expected behavior - we discussed it here before.
Next, taking a look into the root of the error, I checked df_index in _construct_index (timeseries.py, around line 730). What causes this dataset sparsity are newly 0-reencoded ids entered into the dataset after the first pass:
All of these are 0-reencoded new item_ids(I checked) I can't get deeper, but it seems the bug is somewhere in the reencoding process - creating nonexistent sparsity.
Yes. It looks indeed like a bug - good catch! For the moment, I guess new ids are not really supported if they are in the groups. A solution could look like the following:
Group ids are separately encoded by the TimeSeriesDataSet._preprocess()
method. A new label encoder would be needed that adds new ids rather than imputing zeros.
@jdb78 The problem just transformed into another form. On data where crashes used to occur, now this is seen:
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Number of parameters in network: 628.7k
Epoch 0: 15%|█▌ | 15/99 [00:11<01:04, 1.31it/s, loss=1.410, v_num=12, train_loss_step=1.29]Traceback (most recent call last):
File "/home/seva/PycharmProjects/ECOM_demand/classes/model_tf_transformer.py", line 216, in
I can not definitely trace it, because it happens on a random step in an epoch(see I have 99 steps here), but always somewhere on the very first epoch. Datasets that worked smoothly before, still run well now, and those which crashed proceeded to crash, but in a different way.
If you need a reproducible example or some traces - just poke me here :)
@jdb78
Looks like I've found a bug/unexpected behavior. I'm making a prediction on a dataset with time-based features marked as 'categorical', namely month alongside day and year. The start of the dataset is 2020-01-01, and the end is 2020-08-30. The date is parsed into 'year, 'month', and 'day' columns for each row. Depending on the last dataset record's date(if I cut it for some reason), pytorch-forecasting throws an error that looks like:
Traceback: File "XXX/venv/lib/python3.8/site-packages/pytorchforecasting/data/encoders.py", line 105, in
encoded = [self.classes [v] for v in y]
KeyError: '8'
I've made some experiments/stack traces and this is always the case when you, for instance, have this month(8, August) in the full set but don't have it in your training set - for the reason that your max_prediction_length is bigger than 31 (day) or you have a combination of the last date and max_predlength like 2020-08-10 and 20, so the last date of training set will be ~2020-07-20 and it won't have '8' month inside. In this case, going back to the code line provided in traceback, you have this value(8) in np.unique(y) (iterator), BUT in **self.classes** you don't.
Seems like self.classes_ is created based on the training set only, and when you try to invoke TimeSeriesDataSet.from_dataset(trainigset, fullset, .....) you get this error for any additional categorical values that might have appeared in the full dataset.
This logic makes it practically hard to be used on any type of date/time categorically encoded datasets.
Shouldn't any previously unseen categorical value be put into the special 'average' bin and treated as the average of all the known categories? As far as I remember, LightGBM exhibits this behavior for any new categorical values.