[Question] Categorical features

candalfigomoro commented 3 years ago

Hi and thank you for your work.

For the negative sampling, how would you handle a mixed-type dataset with both numerical and categorical features (including boolean features)?

Clearly, I can't extend the values range for categorical features (e.g. for a boolean feature, it doesn't make sense to extend 0/1 values to -0.05/1.05). How would you handle categorical features? Would you just randomly pick one category?

Thanks.

vorhersager commented 3 years ago

Hi, and thanks for your interest in MADI.

If your categorical features have few classes, it's pretty easy. For example, suppose you have a feature with three possible answers on a scale, High, Medium, Low. You can assign values to each on the scale, Low = 0, Medium = 1, High = 3. Choose values that are reasonably small to avoid gradient collapse. Now, in negative sampling on categorical variables, ensure you randomly choose from the three integers. Alternatively, if you have a feature that's not on a scale, like Wendy's, McDonalds, or Chipotle, it's generally best to create binary indicator variables, like is_Wendy's, etc. and choose a negative sample from the two discrete values 1 and 0. If you have a feature with many categorical values, like postal codes, you may need to aggregate them into summary categoricals.

Hope that helps!

Cheers

On Fri, Nov 20, 2020 at 4:18 AM candalfigomoro notifications@github.com wrote:

Hi and thank you for your work.

For the negative sampling, how would you handle a mixed-type dataset with both numerical and categorical features (including boolean features)?

Clearly, I can't extend the values range for categorical features (e.g. for a boolean feature, it doesn't make sense to extend 0/1 values to -0.95/1.05). How would you handle categorical features? Would you just randomly pick one category?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/madi/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4DT7LPU64ILHD5CEY4SVDSQYX7JANCNFSM4T4QOWVA .

candalfigomoro commented 3 years ago

Thank you for your reply.

Suppose I have a categorical feature with the following categories:

Wendys
McDonalds
Chipotle

If I one-hot encode it, I would get 3 binary columns: is_wendys, is_mcdonalds, is_chipotle

The problem is that, if I randomly pick a 0/1 value for these columns, I could generate a sample that is both wendys and mcdonalds at the same time (when for both columns I pick 1) and this is clearly impossible.

So I'd say that it could be better to randomly pick a category before one-hot-encoding the categorical feature, so I can just pick "wendys, "mcdonalds" or "chipotle". The minor problem with this second approach is that I can't never pick an unknown category (while the is_wendys=0, is_mcdonalds=0, is_chipotle=0 combination was possible with the first approach). So, maybe, we could add a special "UNKNOWN" category to "wendys", "mcdonalds" and "chipotle" when I randomly pick one of them.

What do you think about this?

mgf123-tpm commented 3 years ago

Wouldn't you simply use a different value to determine that..., such as ; 0 = Unknown 1 = Wendy's 3 = McDonalds 5 = Chipotle

Would that not work?

google / madi

[Question] Categorical features #7