py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7k stars 924 forks source link

Error in Conditional Effect Estimation in backdoor.linear_regression_estimator #401

Open kenleejr opened 2 years ago

kenleejr commented 2 years ago

Hi all,

I tried to use the linear_regression_estimator in the tutorial. I used the exact notebook but replaced the esimation step with this:

lr_estimate = model.estimate_effect(identified_estimand,
                                    control_value=False,
                                    treatment_value=True,
                                    method_name='backdoor.linear_regression',
                                    test_significance=False)

but get this stack trace:

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-10-e35f61c84e4d>](https://localhost:8080/#) in <module>()
     29                                     treatment_value=True,
     30                                     method_name=Estimator.BACKDOOR_LINEAR_REGRESSION.value,
---> 31                                     test_significance=False)
     32 
     33 

11 frames
[/usr/local/lib/python3.7/dist-packages/dowhy/causal_model.py](https://localhost:8080/#) in estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, method_params)
    229                 params=method_params
    230             )
--> 231             estimate = causal_estimator.estimate_effect()
    232             # Store parameters inside estimate object for refutation methods
    233             # TODO: This add_params needs to move to the estimator class

[/usr/local/lib/python3.7/dist-packages/dowhy/causal_estimator.py](https://localhost:8080/#) in estimate_effect(self)
    168         :returns: A CausalEstimate instance that contains point estimates of average and conditional effects. Based on the parameters provided, it optionally includes confidence intervals, standard errors,statistical significance and other statistical parameters.
    169         """
--> 170         est = self._estimate_effect()
    171         est.add_estimator(self)
    172 

[/usr/local/lib/python3.7/dist-packages/dowhy/causal_estimators/regression_estimator.py](https://localhost:8080/#) in _estimate_effect(self, data_df, need_conditional_estimates)
     49             conditional_effect_estimates = self._estimate_conditional_effects(
     50                     self._estimate_effect_fn,
---> 51                     effect_modifier_names=self._effect_modifier_names)
     52         intercept_parameter = self.model.params[0]
     53         estimate = CausalEstimate(estimate=effect_estimate,

[/usr/local/lib/python3.7/dist-packages/dowhy/causal_estimator.py](https://localhost:8080/#) in _estimate_conditional_effects(self, estimate_effect_fn, effect_modifier_names, num_quantiles)
    234         by_effect_mods = self._data.groupby(effect_modifier_names)
    235         cond_est_fn = lambda x: self._do(self._treatment_value, x) -self._do(self._control_value, x)
--> 236         conditional_estimates = by_effect_mods.apply(estimate_effect_fn)
    237         # Deleting the temporary categorical columns
    238         for em in effect_modifier_names:

[/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py](https://localhost:8080/#) in apply(self, func, *args, **kwargs)
   1273         with option_context("mode.chained_assignment", None):
   1274             try:
-> 1275                 result = self._python_apply_general(f, self._selected_obj)
   1276             except TypeError:
   1277                 # gh-20949

[/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py](https://localhost:8080/#) in _python_apply_general(self, f, data)
   1307             data after applying f
   1308         """
-> 1309         keys, values, mutated = self.grouper.apply(f, data, self.axis)
   1310 
   1311         return self._wrap_applied_output(

[/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/ops.py](https://localhost:8080/#) in apply(self, f, data, axis)
    850             # group might be modified
    851             group_axes = group.axes
--> 852             res = f(group)
    853             if not _is_indexed_like(res, group_axes, axis):
    854                 mutated = True

[/usr/local/lib/python3.7/dist-packages/dowhy/causal_estimators/regression_estimator.py](https://localhost:8080/#) in _estimate_effect_fn(self, data_df)
     61 
     62     def _estimate_effect_fn(self, data_df):
---> 63         est = self._estimate_effect(data_df, need_conditional_estimates=False)
     64         return est.value
     65 

[/usr/local/lib/python3.7/dist-packages/dowhy/causal_estimators/regression_estimator.py](https://localhost:8080/#) in _estimate_effect(self, data_df, need_conditional_estimates)
     44             self.logger.debug(self.model.summary())
     45         # All treatments are set to the same constant value
---> 46         effect_estimate = self._do(self._treatment_value, data_df) - self._do(self._control_value, data_df)
     47         conditional_effect_estimates = None
     48         if need_conditional_estimates:

[/usr/local/lib/python3.7/dist-packages/dowhy/causal_estimators/regression_estimator.py](https://localhost:8080/#) in _do(self, treatment_val, data_df)
    115         new_features = self._build_features(treatment_values=interventional_treatment_2d,
    116                 data_df=data_df)
--> 117         interventional_outcomes = self.model.predict(new_features)
    118         return interventional_outcomes.mean()
    119 

[/usr/local/lib/python3.7/dist-packages/statsmodels/base/model.py](https://localhost:8080/#) in predict(self, exog, transform, *args, **kwargs)
   1036 
   1037         predict_results = self.model.predict(self.params, exog, *args,
-> 1038                                              **kwargs)
   1039 
   1040         if exog_index is not None and not hasattr(predict_results,

[/usr/local/lib/python3.7/dist-packages/statsmodels/regression/linear_model.py](https://localhost:8080/#) in predict(self, params, exog)
    362             exog = self.exog
    363 
--> 364         return np.dot(exog, params)
    365 
    366     def get_distribution(self, params, scale, exog=None, dist_class=None):

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (44906,161) and (201,) not aligned: 161 (dim 1) != 201 (dim 0)

I did some digging, and I'm pretty sure the mismatch has to do with how the estimator featurizes the sub-dataframes which result from grouping by effect modifiers.

If your dataframe has categorical variables (as is the case here), then it is possible for pd.get_dummies in the _build_features() method of the estimator to result in a different number of columns based on what categorical levels are represented in the effect modifier-stratified sub-dataframes. The estimator builds the linear regression model initially on the complete dataset which has all categorical levels present, but when it goes to estimate treatment effects within the effect modifier stratums, some categorical levels may not be present and thus pd.get_dummies gives rise to less columns than the linear regression model expects.

Another note is if you have numerical effect modifiers, these will be converted to new categorical, quantized features prepended with categorical. These are appended to the existing dataframe, but the original numeric columns are still present when calling do and then model.predict. This will also result in a dimension mismatch.

amit-sharma commented 2 years ago

thank you for raising this issue @kenleejr, and for looking deeper to identify the source of this dimension mismatch. I will have a look and raise a PR to fix this.