Closed xuzhang5788 closed 3 years ago
The problem seems here: self.X.iloc[selection, :], self.y[selection]. Probably X or y are not as expected. Try passing y as a Numpy array. Instead as for as sizes and categorical_levels, you need them for building correctly your DNN network, since sizes provide the dimensionality of the numeric array produced by the TabularTransformer (you need it for the input for numerical vairables) and categorical_levels are necessary for sizing correctly the embedding layers of the DNN, since embeddings need that information or they won't work properly.
Just let me know if y a Numpy array works for you. After confirmation I will work out a more robust pipeline for it.
Thank you. It works now.
Could you please explain why categorical_levels are different for each fold in your example, but they are the same in my dataset?
In addition, for example, my categorical column1 has 19 unique values, why is categorical_levels=19+2=21? Thank you!
Perfect, I've already pushed some changes on the repository in order to deal with the target variable in case it is sent as a list or a pandas Series instead of being a Numpy array.
As for categorical_levels, during cross-validation they are on the spot encoded. Therefore for sampling reason some classes may be missing from your training fold, and you may have a different level counting in comparison to other folds where they are not missing.
Cross-validation is usually meant, in real world application, for testing purposes. Therefore I do not encoded on the full data that I have available, but only on the data that I use for train. That's simulates better the real-world testing that the model will have to undergo later when in production.
For the same reason, you find two extra classes that are reserved for 1. missing data 2. unknown data. In fact, it may well happen that you don't have missing data in train, but missingness are present on the test data. Moreover frequently it can happen that you find new levels in test, so you also have to take that into account by a special encoding.
Basically such are just placeholders because if you don't have missing values in train and you don't have anything unkown (well, for definition you do not have anything like that in train) such placeholders will just have random initialitiations and that's all, no meaningful weights will be elaborated during training. Yet their present will allow the model to still perform and do not break down because of unexpected input.
I used my own data to run your code. My model is regression. I followed your code and it is okay for catboost, but for deeplearning part, I got the following error messages:
KeyError Traceback (most recent call last)