grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
953 stars 249 forks source link

causal_forest without orthogonalization #1160

Closed njawadekar closed 2 years ago

njawadekar commented 2 years ago

I understand that the default option of the causal_forest function is for it to use orthogonalization; that is, it incorporates both the actual values of the outcome (Y) and treatment (W), as well as the predicted values of the outcome (Y.hat) and treatment (W.hat) to estimate CATEs as well as to make splits throughout the forest. However, I have a question related to this:

Q: Is there any way to turn "off" the orthogonalization? That is, for demonstration purposes (not practical purposes), I wanted to simulate how performance of the causal_forest degrades if it is trained on only the actual values of the outcome (Y) and treatment (W), and not the residual values (i.e., actual - predicted) ?

erikcs commented 2 years ago

You can do this by setting Y.hat = 0 and W.hat = 0

njawadekar commented 2 years ago

You can do this by setting Y.hat = 0 and W.hat = 0

Thanks for the recommendation, @erikcs . However, when I tried setting Y.hat to 0 and W.hat to 0, and then when I subsequently estimate treatment effects from this causal forest, I get the following warnings: _"Warning in average_treatmenteffect(causalf, target.sample = "all", subset = (causalf$X[, : Estimated treatment propensities go as low as 0 which means that treatment effects for some controls may not be well identified. In this case, using target.sample=treated may be helpful. " And then when I subsequently view the values of the treatment effects estimated from this causal forest, they are all not defined (NaN).

I also followed the recommended advice in the warning message to change the option in average_treatment_effect to target.sample = "treated" (this was also discussed in this old thread). However, when I do this, I still get the same values of NaN when I try to estimate treatment effects.

Do you have any other ideas on how to turn off orthogonalization? Should I just set Y.hat and W.hat to some specific constant and non-zero value in order to avoid this issue?

erikcs commented 2 years ago

Setting W.hat=0 is fine for predicting CATEs, you can't compute an AIPW-average_treatment_effect since it involves division by the propensity score, which you have just set to zero. If you know the treatment assignment probability, you can set it to that, e.g. 0.5 if you're in a RCT.

njawadekar commented 2 years ago

Thanks, @erikcs !

I've just attempted what you said, and I ran an honest causal forest (in an observational setting) while setting Y.hat = mean(Y) and W.hat = mean(W) , where Y is the binary outcome and W is the binary treatment. I am doing this for comparison purposes, to see how the algorithm performs when not using orthogonalization (i.e., just setting them to the means as I wrote out above) vs. when you use orthogonalization (i.e. using all the covariates to predict Y and W in the regression forests).

However, I noticed that although estimation of the CATEs is more biased when not using orthogonalization (which is expected), this model is able to identify the true effect modifiers just as well (via variable_importance) as when using orthogonalization. Do you know why this might be? Does orthogonalization (Y.hat and W.hat) not factor in to the Expected Mean Squared Error criterion equation that's used for generating splits?

erikcs commented 2 years ago

If you want to quantify the value of orthogonalization you can look at Table 1 in https://arxiv.org/pdf/1610.01271.pdf for an example: the last two columns shows how orthogonalization reduces MSE when confounding is present. Under no confounding (first rows), there's not much difference. Causal Forest is a forest-weighted version of the R-learner, there are some more numerical examples/discussion in section 6 https://arxiv.org/pdf/1712.04912.pdf showing when there's a benefit to this approach.

njawadekar commented 2 years ago

Thanks, @erikcs ; yes, I understand your team has already demonstrated reduction of bias in CATES from orthogonalization vs. no orthogonalization in that paper. I also have seen similar findings in my own simulations - so, nothing new there.

However, what I'm further interested in looking into is whether orthogonalization also affects the extent to which certain covariates are selected by the algorithm for making splits. For example, if we control for confounding via orthogonalization, then I would expect this model to do a better job at selecting the true effect modifiers more often for making splits than otherwise, right? However, when I investigated this, I didn't see much difference in the variable_importance lists between orthogonalization vs. not, and so I was curious if you might have any insights into why this is.

erikcs commented 2 years ago

However, when I investigated this, I didn't see much difference in the variable_importance lists between orthogonalization vs. not, and so I was curious if you might have any insights into why this is.

It is hard to make a general statement on this, for example, if the main effect is correlated with the treatment effect, then centering/not centering Y could likely give very similar results. The stronger empirical claim (and why orthogonalization is on by default) is that centering is not expected to decrease performance. It might very well not make much difference in many settings, but when it does, it ~should be an improvement.

You might find some of these lectures interesing: https://www.gsb.stanford.edu/faculty-research/centers-initiatives/sil/research/methods/ai-machine-learning/short-course. The last portions give some intuition on the R-learner and how to think about loss functions for CATE estimation

njawadekar commented 2 years ago

Thanks for the clarification. However, even when I tried simulating extreme scenarios (e.g., strong confounding with large heterogeneities between CATEs and the overall ATE), I'm still getting very similar performance in terms of effect modifier identification via variable importance between the orthogonalized causal forest vs. non-orthogonalized. This makes me wonder whether I am coding something incorrectly. Can you please check below to see if this makes sense as my "non-orthogonalized" model?

X = model.matrix(~ ., data = trialdata[, 3:22])  # Specify all the covariates in dataset (excludes treatment and outcome)
  Y = trialdata$Y # Specify the binary outcome
  W = trialdata$A # Specify the binary treatment

  causalf <- causal_forest(
    X = X,
    Y = Y,
    W = W,
    Y.hat =  mean(trialdata$Y), # This signifies NO orthogonalization for expectation of outcome
    W.hat = mean(trialdata$A), # This signifies NO orthogonalization for propensity of treatment
    num.trees = nTrees, 
    mtry = min(ceiling(sqrt(ncol(X)-1) + 20), ncol(X)-1), 
    honesty = TRUE,
    honesty.fraction = 0.5,
    honesty.prune.leaves = TRUE,
    seed = 112, 
    min.node.size = 5
  )        
erikcs commented 2 years ago

With no orthogonalization we mean W.hat and Y.hat = 0. You could post how you're generating X,Y,W, and how you're measuring performance here for more information.

njawadekar commented 2 years ago

I have some reproducible code attached here if wouldn't mind checking it - does that work? I basically simulate a scenario with confounding and effect modification, and then I output some performance results.

After you run this code, you can compare the ranked VIF lists in summarize_1.csv (orthogonal) and summarize_0.csv (non-orthogonal) for comparison. You will notice that even despite confounding and strong heterogeneities across subgroups, the performance in identifying the effect modifiers "B" and "N," as assessed by ranked VIF, are very similar. Let me know if you see anything wrong with my code.

Simulation_Code_HCF_Orthogonalization.zip

njawadekar commented 2 years ago

I have some reproducible code attached here if wouldn't mind checking it - does that work? I basically simulate a scenario with confounding and effect modification, and then I output some performance results.

After you run this code, you can compare the ranked VIF lists in summarize_1.csv (orthogonal) and summarize_0.csv (non-orthogonal) for comparison. You will notice that even despite confounding and strong heterogeneities across subgroups, the performance in identifying the effect modifiers "B" and "N," as assessed by ranked VIF, are very similar. Let me know if you see anything wrong with my code.

Simulation_Code_HCF_Orthogonalization.zip

Hi @erikcs : following up to my previous comment, I also wanted to say that based on some different simulations I've tried for my "non-orthogonal" causal forest, it seems that regardless of whatever I put in Y.hat and W.hat in the causal_forest, it apparently has no influence on the variable_importance_factor lists that I export in my code. For example, even when I set a vector of randomly generated variables for them (e.g.: Y.hat = runif(nIndividuals, 0.3, 0.51) , W.hat = runif(nIndividuals, 0.6, 0.71) ), the non-orthogonal forest was able to identify the effect modifiers just as well as the standard/default doubly orthogonal one. This makes it apparent to me that the orthogonalization only affects CATE estimation, but that it doesn't tangibly influence the splits that are generated by the honest causal forest via the Expected Mean Squared Error Criterion equation. Can you confirm?

erikcs commented 2 years ago

Ok, I think I got what you are asking about now. I didn't run/read the entire script, but here is a short example of the answer being yes: X2 is split on most in both cases, but the split values are obviously wrong in the 2nd case:

library(grf)
set.seed(42)

n = 8000
p = 5
X = matrix(rnorm(n * p), n, p)
L = 1 / (0.1 + exp(abs(sin(X[, 1]))))
W = rbinom(n, 1, L)
Y = pmax(X[, 2], 0) * W - 2 * L + rnorm(n)
mean(pmax(rnorm(10000), 0)) # true ATE
# [1] 0.4066009

cf = causal_forest(X, Y, W)
average_treatment_effect(cf)
# estimate    std.err 
# 0.41875265 0.02498128 
mean(predict(cf)$predictions)
# [1] 0.4179574
variable_importance(cf)
# [,1]
# [1,] 0.07166115
# [2,] 0.72717591 <---- X2
# [3,] 0.06247482
# [4,] 0.07161082
# [5,] 0.06707731

cf.nocentering = causal_forest(X, Y, W, Y.hat = 0, W.hat = 0)
mean(predict(cf.nocentering)$predictions) # biased
# [1] 0.2186129
variable_importance(cf.nocentering)
# [,1]
# [1,] 0.09987895
# [2,] 0.73826803 <----- X2 
# [3,] 0.05286542
# [4,] 0.04958270
# [5,] 0.05940490

I noticed you had a subgroup exercise in there. The latest grf includes a feature you can use instead of that exercise, it's basically an AUC measure for CATEs (an introduction vignette is here). If you want to walk through more questions you can email me to set up a quick zoom call

njawadekar commented 2 years ago

Ok, I think I got what you are asking about now. I didn't run/read the entire script, but here is a short example of the answer being yes: X2 is split on most in both cases, but the split values are obviously wrong in the 2nd case:

I noticed you had a subgroup exercise in there. The latest grf includes a feature you can use instead of that exercise, it's basically an AUC measure for CATEs (an introduction vignette is here). If you want to walk through more questions you can email me to set up a quick zoom call

Thanks for your reply @erikcs ! What exactly do you mean by the "split values" -- do you mean the Variable Importance Factor value? If so, both algorithms above in your code clearly identify X2 as the variable that split the most; even if the values of 0.72717591 and 0.73826803 are different, these differences are so small that I'm not sure they matter in the practical sense, especially if our goal is effect modifier discovery and we are trying to see which variables have the greatest impact on the causal forest (i.e. the greatest split value). In my simulations, I've also found that such values differ very minimally between centering vs. non-centering. So should we conclude from this that while orthogonalization typically reduces bias (i.e., CATE and ATE estimation), it won't necessarily have a tangible impact on the variable importance values (i.e., split values), and/or the ranking of these split values between the covariates?

Also, it appears your code above is in a randomized setting, so not sure it's the best example. I wanted to see whether, in an observational setting, the algorithm does a less efficient job at identifying the effect modifier (e.g., variable X2) in non-centering vs. centering. You can run my code I've attached in a previous comment if you wanted to see the results from my observational simulations in the "summarize_1.csv" (centering) and "summarize_0.csv" (no centering) tables. The weird part to me was that even despite confounding that was baked into my simulated data, my non-centering model could identify the simulated effect modifiers just as well as the model that was centered.

erikcs commented 2 years ago

What exactly do you mean by the "split values"

By split variable I mean X2 (i.e 0.7 = X2 is used a split variable the most). By split value I mean the cutpoint: which value of X2 the GRF algorithm decide to split on.

Also, it appears your code above is in a randomized setting

The probability of being treated depends on covariates, I am not sure I would call this a randomized setting, as in a RCT if that is what you mean.

njawadekar commented 2 years ago

What exactly do you mean by the "split values"

By split variable I mean X2 (i.e 0.7 = X2 is used a split variable the most). By split value I mean the cutpoint: which value of X2 the GRF algorithm decide to split on.

Thank you for the clarification. Yes, so it is evident that the split or split values were indeed different between the centered vs. non-centered models. But what I was asking was whether we should expect the orthogonalization approach to have any impact on the ordering of the variables from the variable_importance metric output. In your simulation, X2 was found to be the most important split variable in both models, even though I would expect X2 to maybe be the most important split variable (i.e., the variable most frequently used to split) in the centered approach but not necessarily the non-centered approach, particularly in a setting with confounding. Any insights on this would be helpful - thanks

erikcs commented 2 years ago

But what I was asking was whether we should expect the orthogonalization approach to have any impact on the ordering of the variables from the variable_importance metric output

Yes, as I tried to say above

It is hard to make a general statement on this, for example, if the main (baseline) effect is (TYPO: not) correlated with the treatment effect, then centering/not centering Y could likely give very similar results.

Here's an example where there's a difference because the treatment effect is small compared to baseline

set.seed(42)

n <- 500
p <- 5
X <- matrix(rnorm(n * p), n, p)
W <- rbinom(n, 1, 0.5)
Y <- pmax(X[, 2], 0) * W + 5 * X[, 2] + 5 * pmin(X[, 3], 0) + rnorm(n)
cf <- causal_forest(X, Y, W)
cf2 <- causal_forest(X, Y, W, Y.hat = 0, W.hat = 0)

which.max(variable_importance(cf))
# [1] 2
which.max(variable_importance(cf2))
# [1] 5
erikcs commented 2 years ago

Hi @njawadekar, closing this in case it's clear now. Also leaving this recent overview paper here which might be of help (ref discussion on centering): https://arxiv.org/abs/2206.10323