matheusfacure / python-causality-handbook

Causal Inference for the Brave and True. A light-hearted yet rigorous approach to learning about impact estimation and causality.
https://matheusfacure.github.io/python-causality-handbook/landing-page.html
MIT License
2.65k stars 463 forks source link

Issue: Chapter 7, "Bad COP" section #261

Open imurray opened 2 years ago

imurray commented 2 years ago

There is an issue on chapter 07, in the "Bad COP" Section:

The issue is in estimating the COP part. It will be biased even under random assignment. ... I sincerely hope this convinces you to avoid COP like the plague. I see too many Data Scientists doing this separate estimation, unaware of the problems that it brings.

I'm afraid I'm not convinced: I can't follow the reasoning, which appears to either be missing some details, or is wrong.

As the chapter says, the two-part decomposition that's described is true, and (unlike what the chapter seems to say) each part can be estimated from the observable data. Just to be explicit, there's python code for a toy example below. As we should expect, we get precisely the same E[Y|T] and ATE values regardless of whether we estimate the expectation directly for each treatment, or break the estimate for each treatment into two parts.

Is the "bias" the chapter describes from assuming the expectation is something it's not? The relevant examples for the conditioned-on-positive (COP) expectation really can change under treatment, which is a valid part of estimating the effect of the treatment. I think the chapter must be warning against some wrong use or interpretation of the COP expectation? Perhaps using it in place of E[Y|T], which would just be wrong? But I can't find where the chapter is specific about what the wrong procedure is. Instead, the whole approach is deemed flawed.

The two-step approach (also known as a hurdle model) is a natural and sensible model for data with a spike at zero. There's no point adding the complication for the toy example below (it's not wrong, it just makes no difference), but where we need to control for other variables, the added structure of the two-step approach often leads to a parametric model that fits real-world data better. I don't think "Data Scientists" are ignorant for using this style of model!

If there really is a reason to avoid this modelling approach "like the plague", I'd love to understand it, but I think the chapter needs to be more specific about its claims.

import numpy as np

N = 1000

# Unobserved property for each individual:
rich = np.random.rand(N) < 0.3

# Random assignment of treatments, independent of "rich":
treated = np.random.rand(N) < 0.2

# Create model params for each example, depending on whether in control or treated.
# Can interact with latent variable "rich".
gparam_1 = 5 + 2*rich + 3*treated + rich*treated
gparam_2 = 50 + 10*rich + 20*treated + 5*rich*treated
hurdle_probs = 0.6 + 0.2*treated
#hurdle_probs = rich | treated # would work fine too

# Simulate data
mask = np.random.rand(N) < hurdle_probs
Y = mask * np.random.gamma(gparam_1, gparam_2, N)

# Straightforward estimate of Average Treatment Effect (ATE)
ATE1 = Y[treated].mean() - Y[~treated].mean()
print(ATE1)

# Two part estimates, only depend on observed data, and known treatment labels:
p0 = (Y[~treated] > 0).mean()
p1 = (Y[treated] > 0).mean()
cop0 = (Y[~treated & (Y > 0)]).mean()
cop1 = (Y[treated & (Y > 0)]).mean()

# Estimate of ATE using the 2-part estimates:
ATE2 = p1*cop1 - p0*cop0
print(ATE2) # the same as before
matheusfacure commented 2 years ago

You are correct. This part needs some rewriting because its is very confusing. What I meant there was that doing p1 (cop1-cop2) was the problem. This is more common than you might think as ppl expect that the effect of the treatment will be the effect on those that converted (cop) times the conversion rate. This will NOT recover the effect on the converted, which is what ppl think

imurray commented 2 years ago

Thanks for the reply, and confirming what was intended.

I can believe some people forget that that the "conversion probability" p changes. It is strange in the context of marketing though (there example given), where in some businesses a large part of the treatment effect is to get people to buy something at all, rather than pushing existing customers towards a more expensive item. For a company that only had one service at a fixed price, the entire treatment effect of marketing would be in the conversion probability.

dvanlunen commented 1 year ago

+1 someone referenced this to me and I did the same simulation exercise to show there is no bias.

But I should say, I love your book because it's extremely clear!

kane9530 commented 12 months ago

Thanks @matheusfacure for the marvellous book, I have been enjoying the readings!

I am struggling to understand the "Bad COP" part of this chapter, and still remain rather confused after reading this issue. Am hoping to gain some clarity here.

Specifically, I have trouble digesting the idea that, even if the hurdle model approach of partitioning the expectation into a participation component + COP component is mathematically sound, nevertheless there is an issue with its use in causal estimation pertaining to a selection bias issue.

@imurray seems to have shown that there is nothing wrong with the hurdle model, in the final part of the python code, where ATE1=ATE2, and ATE2 is computed I believe using

Screenshot 2023-09-15 at 4 10 31 PM

My current understanding is that the problem seems to arise from the wrong use of expectation / probability when computing ATE2 in the hurdle model, rather than the issue with the hurdle model per se. Why is the wrong formula typically used?