How to pass already label encoded features as categorical to produce embeddings.

Kimonili commented 3 years ago

Hi @jdb78 and congrats on this amazing library!

I have some static categorical features (same across all the time series) which I have already label encoded. Unfortunately, the dataset is split across thousands of pickle files which I will be feeding to the model one at time. The thing is that each pickle file does not contain all the unique values of each static categorical variable and thus I cannot pass them as strings for the model to take care of the embeddings.

For example one categorical feature might have 20 unique values across all the pickle files but only 1 or 2 unique values in the same pickle file which is fed to the network. Thus, the model will only see 2 values for this categorical feature and will produce 2 embedding vectors (when there should be 20).

Is there a way to bypass the label encoding step in the temporal fusion transformer model while handling these values as categorical (although they would have already been encoded) and produce the embeddings straight away?

Thanks in advance.

jdb78 commented 3 years ago

You can pass a pre-trained LabelEncoder here to the TimeSeriesDataSet with categorical_encoders={<variable_name>: NaNLabelEncoder()}. Hope this helps.

Kimonili commented 3 years ago

Thanks for the answer @jdb78.

So there is no way to directly pass already encoded categorical data to the TimeSeriesDataSet as 'categorical' features (although encoded as integers), and do no transformations to them in the TimeSeriesDataSet but only to create the embeddings later during training?

jdb78 commented 3 years ago

Not, not really. But you if you know the number of labels n, you can simply do this: categorical_encoders={<variable_name>: NaNLabelEncoder().fit(np.arange(n)}

Kimonili commented 3 years ago

Okay - thanks @jdb78. Your feedback is much appreciated!

sktime / pytorch-forecasting

How to pass already label encoded features as categorical to produce embeddings. #524