How to deal with the discrete variables that are not binary?

ronikobrosly / causal-curve

A python package with tools to perform causal inference using observational data when the treatment of interest is continuous.

MIT License

271 stars 18 forks source link

How to deal with the discrete variables that are not binary? #38

Closed v6l4188 closed 3 years ago

v6l4188 commented 3 years ago

Hello :D Here I have a question. In your end-to-end demonstration, some features are discrete, but they are binary - 0 or 1. But in my data, the discrete features are not binary, for example, one feature can be an integer between 0 to 30. In this case, how to deal with this kind of feature? If I use one-hot method, will the dimension be too high and the data become too sparse? Or should I use binary coding? Or is it better to do nothing with it? It will be appreciated if you can help me :D

ronikobrosly commented 3 years ago

Hello @v6l4188 ! That’s a great question. To answer that, could you tell me how many observations you have in your data frame (what’s the N)? If you were to one-hot encode all possible discrete features, how many feature columns would you want to use?

v6l4188 commented 3 years ago

Hello @v6l4188 ! That’s a great question. To answer that, could you tell me how many observations you have in your data frame (what’s the N)? If you were to one-hot encode all possible discrete features, how many feature columns would you want to use?

Thank you for your quick reply! I have 354,218 observations, and the number of features is 54. If I use one-hot to encode the discrete features, there are 18 features to be encoded, and the final total number of features is about 200.

ronikobrosly commented 3 years ago

@v6l4188 ahh ok. I was trying to get a quick sense of the ratio of observations to parameters/features in your case. If that feature is nominal/non-ordered discrete (e.g. there are 31 countries in that feature) then it should be one-hot encoded into 30 binary features. If it is naturally ordered discrete (e.g. the 31 categories represent income levels from low to high), then it’s up to you whether to make binary or just leave as one feature of 31 possible integers. If it is ordered discrete, I would probably leave as one ordered integer feature, just to make things simple. Does this help?

v6l4188 commented 3 years ago

@v6l4188 ahh ok. I was trying to get a quick sense of the ratio of observations to parameters/features in your case. If that feature is nominal/non-ordered discrete (e.g. there are 31 countries in that feature) then it should be one-hot encoded into 30 binary features. If it is naturally ordered discrete (e.g. the 31 categories represent income levels from low to high), then it’s up to you whether to make binary or just leave as one feature of 31 possible integers. If it is ordered discrete, I would probably leave as one ordered integer feature, just to make things simple. Does this help?

Yes, it helps. :D thank you again!