py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.88k stars 916 forks source link

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1135

Closed drawlinson closed 3 months ago

drawlinson commented 5 months ago

An earlier issue #1111 observed inconsistent behaviour from RegressionEstimator subclasses when new data for do() method had different rows than the originally fitted data, which caused categorical variables to be encoded inconsistently. This is because the do() operator allows unseen data to be processed with an existing Estimator.

This issue occurs because categorical encoding was using Pandas' get_dummies(), which does not allow additional data to be encoded using an existing encoder. An alternative, skLearn OneHotEncoder, returns an Encoder object which can be used to encode additional data consistently. skLearn is already a DoWhy dependency. For this reason skLearn is preferred over get_dummies.

This additional change goes further to replace all occurrences of get_dummies with OneHotEncoder, so that if functionality to process additional data is added to other classes in future (e.g. via do operator), the consistency bug won't happen again.

After the swap, all these changes are also heavily covered by existing tests, each time an Estimator is created and fitted, or when an effect is estimated.

drawlinson commented 4 months ago

hi @amit-sharma are you able to take a look at this one? Thanks!

drawlinson commented 3 months ago

@amit-sharma I added some tests which aim to verify that encoding is consistent despite permuting data row order. It was a bity tricky working within the interfaces of the Estimator classes - I focused on estimate_effect() and do(x). With Regression estimators the effects of common causes are additive, so the ATE is almost unchanged despite changes in these variables! To check for consistency of these variables' encoding using I used the do() operator, the result of which is affected by common causes.

In the process I discovered that the RegressionEstimator implementation of do() has a seemingly long-standing bug where the order of the arguments is reversed:

CausalEstimator base class (treatment_value, dataframe): def _do(self, x, data_df=None):

RegressionEstimator (dataframe, treatment_value): def _do(self, data_df: pd.DataFrame, treatment_val):

I've fixed RegressionEstimator to match the base class interface. I searched for all instances of _do( and only needed to fix the implementation of estimate_effect in Regression.

Changed from: effect_estimate = self._do(data, treatment_value) - self._do(data, control_value)

to: effect_estimate = self._do(treatment_value, data) - self._do(control_value, data)

I'm sorry this has turned into a big PR but hopefully it's worth it!

drawlinson commented 3 months ago

Build docs appears to be failing due to lack of disk space on the worker environment