py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.99k stars 923 forks source link

Fix to ensure that all CATE estimates are returned #988

Closed amit-sharma closed 1 year ago

amit-sharma commented 1 year ago

Fixes #890 The CATE estimates array structure is like this: image

@kbattocchi Is the estimates array returned by EconML nested like this? Does the structure look good?

kbattocchi commented 1 year ago

@amit-sharma on the EconML side we accept a number of different possibilities, which I'll explain just for completeness:

When fitting, each of T and Y could be a 1-dimensional vector or a 2-dimensional array (with one or more columns), and X could be either a 2-dimensional array of features or None (which will effectively result in the computation of an ATE rather than a CATE).

When calling effect(T0, T1, X), as a convenience we allow using scalars for the initial and target treatments, which we then treat as if they had as many rows as X does; if X is None and the treatments are passed as scalars then we treat it as if there were a single row. The result will have the same dimensionality as Y (if Y was a 1-dimensional vector then so will the result from effect, otherwise it will be 2-dimensional with the same number of columns as there were outputs when fitting).

So in this example, if Y was 2-dimensional but had a single outcome, then the shape of the output looks correct.

However, implementation-wise, I don't think this fix is correct: EconML will expand the treatments internally if necessary, so passing them as scalars is fine and you shouldn't need to perform this expansion yourself; instead I think that the root cause is that somehow X is not being passed through, so because it's None you're only currently only getting one row back. Manually expanding the treatments will give you the right number of rows back, but it's going to just have the ATE repeated that number of times rather than the CATE based on the actual values of X as desired.

amit-sharma commented 1 year ago

thank you @kbattocchi for your detailed response. Agree that just broadcasting T0 or T1 should not be the right solution.

When I looked at the issue #890 again, I realized that there is no bug in the DoWhy code. Rather, the reason is that they had not provided any effect modifiers in their dataset/graph. Once at least one effect modifier is introduced, the CATE estimates are returned. I replied on that issue.

I think we can close this PR.

amit-sharma commented 1 year ago

Thanks @kbattocchi for raising the issue #990 That example was helpful to identify the bug. I have revised the PR.

  1. The error was not in the Econml class, but in the way the effect_modifier parameter was being passed to it. So the relevant changes are in causal model and causal graph class.
  2. I have added test cases based on your issue, so now we can track any related error due to future PRs.