grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
957 stars 250 forks source link

QUESTION: handling NAs in Y arising from panel attrition. #1373

Closed go-bayes closed 10 months ago

go-bayes commented 10 months ago

Causal Random Forests

Dear grf developers,

Thank you for this wonderful tool.

I am using the grf package for a causal forest analysis involving longitudinal data and facing the challenge of missing outcomes arising from panel attrition/non-response

A further complication is survey weights, which we hope to use to generalise.

My concerns are: (1) censoring model misspecification (can we do better)? (2) invalidating inference by multiplying the censoring and sampling weights and passing to grf

My apologies if I missed your recommended approach in this setting.

Thank you for your assistance and thank you once again for developing this really great software!

Joe

# 3-wavel panel design,
# gist: Y_time2 ~ tau * W1_time1 + beta * X_time0 (see VanderWeele's:  https://doi.org/10.1214/19-STS728 )
# problem: loss-to-follow-up leads to missing responses in Y_time2

# using "grf 2.3.1"
library(grf)

# simulate data for the three-wave panel
set.seed(123)
n <- 1000 # n of observations
p <- 5   # n of covariates

# wave 0 (baseline) data
X0 <- matrix(rnorm(n * p), n, p) # covariates
W0 <- rbinom(n, 1, 0.5)          # treatment indicator at baseline 
Y0 <- rnorm(n)                   # outcome at baseline

# wave 1 data (post-baseline) exposure
# assume W is the treatment and is measured again without missing values
W1 <- rbinom(n, 1, 0.5) # the exposure (controlling for baseline exposure)

# combine covariates matrix (X)
# here, we assume X, W, and Y observed at baseline without missing values
# generally, however, we would have missing values and this would complicate
# the censoring weights model. 

# but for simplicity:
X <- cbind(X0, W0, Y0) # no missingness

# wave 2 data (baseline + 2), the outcome wave
# simulate attrition leading to missing Y values, arbitrarily assume 20% loss

Y2 <- rnorm(n)  # outcome at baseline + 2
missing_indices <- sample(1:n, n * 0.2) # indices of missing data
Y2[missing_indices] <- NA  # NA values to simulate missing data

# fit the causal forest model
# as expected, we return an error because Y2 cannot contain missing values.
cf <- grf::causal_forest(X, Y2, W1)

# possible work around to verify...

#  1:  fill out the example by simulating survey weights
survey_weights <- runif(nrow(X), min = 0.5, max = 2)

#  2: create binary indicator for missing Y2 values
Y2_missing <- as.integer(is.na(Y2))

#  3: fit a censoring model: ** this seems like a hack **
censoring_model <- glm(
  Y2_missing ~ ., data = as.data.frame(cbind(X, W1)), family = "binomial"
  )

#  4: compute inverse probability of censoring weights
uncensored_probs <- predict(
  censoring_model, type = "response", newdata = data.frame(X, W1)
  )
censoring_weights <- 1 / uncensored_probs

#  5: consider trimming extreme weights (depending on context/question etc.)
weights_cap <- quantile(censoring_weights, 0.99)
censoring_weights_capped  <- pmin(censoring_weights, weights_cap)

#  6: combine censoring weights with survey weights (use "censoring_weights" if preferable)
combined_weights <- censoring_weights_capped * survey_weights

#  7: fit the causal forest model with combined survey & censoring weights. 
# concern: are sample.weights working as we hope? 

cf_model <- causal_forest(
  X = X[!is.na(Y2), ],
  Y = Y2[!is.na(Y2)],
  W = W1[!is.na(Y2)],
  sample.weights = combined_weights[!is.na(Y2)]
)
erikcs commented 10 months ago

Hi @go-bayes,

If you are worried about model mis-specification for the censoring process, you could estimate it non-parametrically with a survival_forest (PS: these can handle missingness in the covariates). This is the approach we take here where we describe an IPC-weighted causal forest in section 3.2 (with a doubly robust correction in the remaining part of that paper, available in the function causal_survival_forest).

Using sample weights that are a product of other weights comes up in various settings, in this handbook chapter (Example 13) there's an example of an estimator (IPC-weighted R-learner) that can be fit via using weights that are a product of IPC weights and centered propensity scores.

go-bayes commented 10 months ago

Dear @erikcs,

Thank you very much for your clear reply. This is an excellent recommendation. I'm looking forward to giving it a try!

-- Joe