py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.82k stars 714 forks source link

Orthogonal/Double ML: interaction of multiple treatments academic reference? #281

Open ghost opened 4 years ago

ghost commented 4 years ago

Hello,

Would like to know if there is any academic paper or sources that include the proof to show that it's eligible to create interaction terms of two different treatment variables in each stage of Double ML?

I noticed this section "What if I have many treatments?" in EconML the Double ML user guide. It suggests:

"The method is going to assume that each of these treatments enters linearly into the model. So it cannot capture complementarities or substitutabilities of the different treatments. For that you can also create composite treatments that look like the product of two base treatments. Then these product will enter in the model and an effect for that product will be estimated. This effect will be the substitute/complement effect of both treatments being present, i.e. If your treatments are too many, then you can use the SparseLinearDMLCateEstimator. However, this method will essentially impose a regularization that only a small subset of them has any effect."

Just want to see if you can help point to any academic or paper reference for above way to handle treatment interaction? Thank you!

vsyrgkanis commented 4 years ago

The theory presented in https://arxiv.org/abs/1608.00060 is general enough to capture this extension too as long as the number of interaction terms is fixed and doesn't grow with the sample size. The idea is just that you can "redefine" what the treatment variable is.

You can define composite treatments of the form: (\tilde{T}_1, \tilde{T}_2, \tilde{T}_3) = (T_1, T_2, T_1 * T_2) (and similarly for more treatments).

Then as long as the outcome is linear in these composite treatments, i.e. Y = theta*\tilde{T} + g(X) + epsilon, with E[epsilon | \tilde{T}, X] = 0, then the same theory applies.

Another place where such hard-coded featurizations have been used explicitly is this work: https://arxiv.org/pdf/1905.10116.pdf on policy learning with continuous actions. Albeit there the goal was policy learning not CATE estimation (though the two tasks are very inter-related).

ghost commented 4 years ago

Thanks a lot, @vsyrgkanis. This is very helpful!

Suppose my outcome variable Y is heart rate, and say we are interested in 3 treatments: did the person walk around or did more intense exercise in the last 10 min before heart rate measurement (T1 = exercise); did the person sleep well the night before (T2 = sleep), and did the person eat more than 200 calories of food within 1 hour of heart rate measurement (T3 = eat). Then we have a few pre-treatment confounders, X, such as age, gender, presence of other diseases, BMI, etc.

Now we want to use double ML to see if T1 exercise, T2 sleep, T3 eat would cause higher or lower Y heart rate, through fitting the Double ML method. Per your recommended approach, shall we simply create a multivariate vector to represent all possible combination of treatment? Specifically, which way below is correct, (A) or (B)?

Approach (A): post-featurized treatment vector T = [T1, T2, T3, T1 T2, T2 T3, T1T3] = [exercise, sleep, eat, exercise sleep, sleep eat, exercise eat]? For each data point, it's okay to let it take the value 1 for more than 1 position in this 6-length treatment vector? For example, the following data point would end up having treatment vector T represented in the way below, correct?

  T1 exercise T2 sleep T3 eat Featurized T with interactions for Double ML
Data point 1 0 1 0 [0,1,0,0,0,0]
Data point 2 1 1 1 [1,1,1,1,1,1]
Data point 3 0 1 1 [0,1,1,0,1,0]

Just want to confirm that we do not need to create full factorials as illustrated in approach B below, correct? Approach (B) Factorials treatment T = [Ta, Tb, Tc, Td, Te, Tf, Tg, Th], where Ta = (T1 = 0, T2 = 0, T3 = 0) = (exercise = 0, sleep = 0, eat = 0) Tb = (T1 = 0, T2 = 0, T3 = 1) = (exercise = 0, sleep = 0, eat = 1) Tc = (T1 = 0, T2 = 1, T3 = 0) = (exercise = 0, sleep = 1, eat = 0) Td = (T1 = 1, T2 = 0, T3 = 0) = (exercise = 1, sleep = 0, eat = 0) Te = (T1 = 1, T2 = 1, T3 = 0) = (exercise = 1, sleep = 1, eat = 0) Tf = (T1 = 1, T2 = 0, T3 = 1) = (exercise = 1, sleep = 0, eat = 1) Tg = (T1 = 0, T2 = 1, T3 = 1) = (exercise = 0, sleep = 1, eat = 1) Th = (T1 = 1, T2 = 1, T3 = 1) = (exercise = 1, sleep = 1, eat = 1)

  T1 exercise T2 sleep T3 eat Ta = (exercise = 0, sleep = 0, eat = 0) Tb Tc= (exercise = 0, sleep = 1, eat = 0) Tg = (exercise = 0, sleep = 1, eat = 1) Th = (exercise = 1, sleep = 1, eat = 1)
Data point 1 0 1 0 0 0 1 0 0
Data point 2 1 1 1 0 0 0 0 1
Data point 3 0 1 1 0 0 0 1 0

Thanks again! :)

vsyrgkanis commented 4 years ago

@raylinz both approaches are correct, but each has pros and cons.

The first approach makes a structural assumption about how complementary effects arise to have the additive model, i.e. that

E[Y|T] = a*T_1 + b*T_2 + c * T_1 * T_2

The second approach is a fully non-parametric approach with respect to T, i.e.

E[Y|T] = f(T_1, T_2)

where f can be any function. Equivalently you can write this as a linear function of the configurations in approach B:

E[Y|T] = \sum_{t1, t2\in {0, 1}}  f(t1, t2) * 1\{T1=t1, T2=t2\}

which is exactly your approach B.

So the second approach is still correct but fully non-parametric. This can have huge variance, since essnetially you are treating each treatment combination as a separate treatment, e.g. the value of the outcome for T1=1, T2=1, will not use any information from samples that had either T1=1 or T2=1, but will sort of only use samples that received exactly this treatment combination. So you will need a loooot of samples and a loooot of exploration/randomness/natural experimentation in your data, i.e. every treatment combination would need a positive probability of assignment conditional on each X, W. The first would need much fewer samples and would need weaker exploration. So in most practical cases I would go with the first approach A. Especially if you have more than 2 treatments.

ghost commented 4 years ago

@vsyrgkanis Yep, that's what I thought, thanks a lot for confirming this. +1 that approach A seems much more practical.

Chandrak1907 commented 4 years ago

@raylinz This might be helpful -- https://arxiv.org/pdf/2001.06483.pdf (Estimation of Causal Effects of Multiple Treatments in Observational Studies with a Binary Outcome)

ghost commented 4 years ago

@Chandrak1907 Thank you! Will take a look at this paper - looks very helpful