missing values and unknown categories in SSL models

manujosephv / pytorch_tabular

A standard framework for modelling Deep Learning Models for tabular data

https://pytorch-tabular.readthedocs.io/

MIT License

1.31k stars 134 forks source link

missing values and unknown categories in SSL models #460

Open sorenmacbeth opened 1 month ago

sorenmacbeth commented 1 month ago

Hello,

Allowing for missing values and/or unknown categories is not allowed for SSL models. Could you help me understand why this is the case? I real-world data this causes hard to understand error messages which then requires out-of-band pre-preprocessing of the data to resolve.

Could we allow for these options to be available in SSL models? Is there a fundamental reason that I am not understanding for this restriction?

sorenmacbeth commented 1 month ago

reference to the section of the code in question: https://github.com/manujosephv/pytorch_tabular/blob/728578765b705cef5867f49289cf1cf203f1898f/src/pytorch_tabular/tabular_model.py#L234-L240

sorenmacbeth commented 1 month ago

last bit of color: in a fork I removed this validation block and I was able to test and using both missing value and missing category handling in an SSL model training.

manujosephv commented 2 weeks ago

In SSL model(right now it's the Denoising Autoencoder), we are training the model to predict the input data back. In this learning objective, predicting missing values as a separate token didn't make sense to me. This is why that option was disabled to force the user to treat the missing values the right way.

Unlike prediction task, where it's beneficial to learn when some new category value shows up, in SSL does it make sense?

sorenmacbeth commented 2 weeks ago

In SSL model(right now it's the Denoising Autoencoder), we are training the model to predict the input data back. In this learning objective, predicting missing values as a separate token didn't make sense to me. This is why that option was disabled to force the user to treat the missing values the right way.

Unlike prediction task, where it's beneficial to learn when some new category value shows up, in SSL does it make sense?

If missing values are expected to be present in the data at prediction time, allowing for them in the training data makes sense to me. As a practical matter, I would prefer the user be allowed to decide for themselves if they want this behaviour or not. Perhaps a warning in the logs or in the documentation instead of explicitly disabling the ability to choose might be a better option?

manujosephv commented 2 weeks ago

Hmmm... Yeah, I agree. But will also have to thoroughly test the inclusion for corner cases.

Would you be willing to raise a PR for it?

sorenmacbeth commented 2 weeks ago

Sure thing:

https://github.com/manujosephv/pytorch_tabular/pull/470

I've been running this for a good month or so without issue but I'm happy to add / update documentation or test cases if you can describe to me what needs to be done.