py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.01k stars 923 forks source link

DoWhy Logistic Regression with Stats Api #296

Open cleli94 opened 3 years ago

cleli94 commented 3 years ago

Dear authors,

I am using dowhy for a project, and it is a GREAT tool!

Basically, I was comparing the results obtained with the method backdoor with logistic regression using stats api as suggeted by you with a method created from scratch using scikit-learn. The results were very different, and mine seemed to be the more plausible. Moreover, the result should be the same as the S-Learner with LR, If I am not mistaken. Mine was equal, while using stats api very different.

I think there could be an issue with the GLM methods: when you call .predict with GLM from stats, you do not obtain the prediction (i.e., 0 - 1) but you obtain the probability. While in scikit-learn you obtained directly the class prediction:

**

**

So, is it true that you're actually using .predict returning the probabilites? In this case, why are you taking the probabilities for computing the ATE instead of the class prediction?

Thank you very much in advance!

amit-sharma commented 2 years ago

For most cases, probabilities are the correct output to use for computing the causal effect on a binary output. The expression is, E[Y|do(T=1] - E[Y|do(T=0] = P[Y=1|do(T=1] - P[Y=1|do(T=0] so it makes sense to use the probabilities.

To see an extreme example, consider that T and Y are both binary and there are no confounders. The true generating equation for Y is, y=Bernoulli(sigmoid(t*beta + N(0,0.01)) and beta is 0. So the causal effect of T on Y is zero.

Still, it can be useful to add flexibility to directly output the class prediction, e.g., for comparison with a default logistic metalearner. I've added an PR #386 for adding an argument predict_score to the GLM estimator. This can be specified in method_params of estimate_effect. It is True by default.