jrzaurin / pytorch-widedeep

A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch
Apache License 2.0
1.27k stars 188 forks source link

Handle "multicategorical" columns #227

Open davidfstein opened 3 weeks ago

davidfstein commented 3 weeks ago

The pytorch_frame library natively handles categorical variables where the variable may take on multiple categories simultaneously, e.g. row1 = [1, .5, ['a', 'b', 'c']], row2 = [2, .3, ['a']] ...

It would be a nice quality of life enhancement to have this sort of functionality added to the widedeep library.

I believe, though I need to look more carefully, that they do something along the lines of 1) label encode the categories 2) convert to tensors such that multicategorical feature a is replaced with an "embedding" of shape n rows x max categories for single row. Rows with variables taking on fewer than max categories for single row take -1 in the "missing" columns. I imagine there are other options for handling this also.

jrzaurin commented 3 weeks ago

Hey @davidfstein

I can look into this, but you can just consider the column that can take multiple categorical values as text and use this library as it is (?) Or turn the multicategorical columns into multiple columns if that is possible and proceed as usual?

But I will look into this :)

davidfstein commented 2 weeks ago

Thanks @jrzaurin ! Actually right now I am following your first suggestion and processing them as text. I was only concerned that this might become inefficient for many features if a separate RNN needs to be trained for each feature. As for splitting into multiple columns, I was thinking you might lose information if each column doesn't contain the full complement of possible categories, but I'm not sure if this concern would lead to substantive performance decrease or not.

jrzaurin commented 2 weeks ago

Let's see if I can put some functioning code tomorrow :)

davidfstein commented 2 weeks ago

That would be awesome! Thanks for the great library!