mckinsey / causalnex

A Python library that helps data scientists to infer causation rather than observing correlation.
http://causalnex.readthedocs.io/
Other
2.21k stars 256 forks source link

Is categorical feature currently supported by causalnex with label encoding? #170

Open tonyabracadabra opened 2 years ago

tonyabracadabra commented 2 years ago

I know conducting label encoding on categorical variable would make the algorithm works with categorical variables, but is it mathematically valid for validating their causal relationships when those label encoding are applied?

tonyabracadabra commented 1 year ago

Hey folks, is there any updates on this question? @oentaryorj @GabrielAzevedoFerreiraQB Any insights would be helpful. I think we might need to handle the independence test for categorical variable separately and I am not sure if that is implemented in the system now.

GabrielAzevedoFerreiraQB commented 1 year ago

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

tonyabracadabra commented 1 year ago

Hey Tony,

Hope you are well! Thanks for the great question!

You're absolutely right.

  • For NOTEARS, we do need continuous variables as you correctly mentioned.
  • It doesn't always make sense to do a simple label encoding. For example, encoding a variable "countries" directly ("randomly") would not give any signal for NOTEARS to learn relationship.
  • However, in certain situations it is still possible to do such encoding:

    • case where variables are binary
    • case where there is an ordinal order in the variables - say days of the week (to certain extent)

One thing to note, though, is that NOTEARS is not "scale invariant", meaning that if we multiply a variable by a constant, NOTEARS results are different. There are discussions on the best way to handle this, but I'd (personally!) recommend thinking about normalizing the variables more carefully if dealing with encoded discrete variables

Thanks Gabriel for answering my question!

I saw that in the release note, it says Added categorical distributed data support for pytorch NOTEARS., what does that mean?

Is there any plans on supporting causal discoveries with mixed type of data with newly published papers?

jinowork commented 1 year ago

in that case, can i do one hot encoding for categorical variables?