insongkim / PanelMatch

113 stars 34 forks source link

NA's in post-treatment dependent variable #60

Closed jyf29 closed 4 years ago

jyf29 commented 4 years ago

Hello!

I'm using this package on a dataset where some of the dependent variables are NA in certain time periods post-treatment. How does PanelEstimate handle the NAs? I had thought they were just dropped, but when stepping through the code, I found this line in the panel_estimate function ("data[, dependent][is.na(data[, dependent])] <- 0", line 310 in PanelEstimate.R), which is setting missing dependent variables to 0.

If the function is indeed setting missing dependent variables to 0, is there a way to drop those observations instead? In my context, there is no treatment reversal, so it's safe to assume that in all lead periods, the treatment is on. I had tried dropping rows with NA outcome variables for the input to PanelEstimate, leading to an unbalanced panel, but it looks like panel_estimates then creates a balanced data set anyways and fills in the NAs with 0.

Thank you!

Jessica

adamrauh commented 4 years ago

Hi @jhyu1, thanks for raising this!

Having reviewed the code, your initial assumption that units with NAs in the lead window of the dependent variable are dropped is correct (or at least, this is how things were designed to operate). This should be happening in the clean_leads function in perform_refinement if you are interested in checking out the source code.

The line you point out basically helps make the computation easier. Those rows/units won't factor into the calculation, so you could actually assign any non-NA/NULL value there and the results wouldn't change.

Hopefully that helps to clear things up. Let us know if you have other questions.

jyf29 commented 4 years ago

Hi @adamrauh

Thank you for looking into this! I am a bit confused them for why I'm seeing the following in my estimates. When I fill in the NAs with the average of the outcome variable for the respective treatment group and time period, I get drastically different results.

I also created a simulated data set (R code below) where treatments are assigned given some covariate z that also affects the outcome variable. Estimating the treatment effect using this package gives me the correct ATT, but when I randomly set some outcomes in the treatment group in the treatment periods to NA, I get a much lower estimated ATT (which is consistent with the NA's being filled with 0s).

Do you know why this could be happening?

`rm(list = ls()) library(PanelMatch) library(data.table)

ATE = 5 timeperiods = 10 N = 100

set.seed(123) dt = data.table(unit = rep(1:N, each = timeperiods), time = rep(1:timeperiods, N))

units = data.table(unit = 1:N) units[, z := rnorm(nrow(units))] units[, eps :=rnorm(nrow(units)) ] units[, treated := as.numeric((z + eps) > 0.75)]

dt = merge(dt, units, by = 'unit') dt[, eps2 := rnorm(nrow(dt))] dt[time <= 5, treated := 0] dt[, y := ifelse(treated == 1, (ATE * treated) + z + eps2, z + eps2)] dt[, z2 := rnorm(nrow(dt))]

setnames(dt, 'z', 'z_covariate')

r = lm(y ~ treated, data = dt); summary(r)

this_matching <- PanelMatch(lag = 1, time.id = "time", unit.id = "unit", treatment = "treated", refinement.method = 'CBPS.match', data = as.data.frame(dt), match.missing = T, covs.formula = ~ I(z) + I(z2), size.match = 1, qoi = "att", outcome.var = "eps", lead = -5:4)

results <- this_matching units = (gsub("\..*","",names(results$att))) matchedsets = data.table() for (i in 1:length(units)) { set = data.table(treat = as.numeric(units[i]), control = as.numeric(names(which(attr(results$att[[i]], 'weights') > 0)))) matchedsets = rbind(matchedsets, set) } attr(this_matching, 'outcome.var') <- 'y' PE.results <- PanelEstimate(sets = this_matching, data = as.data.frame(dt)) plot(PE.results) abline(h = ATE)

set.seed(2) dt[, eps3 := rnorm(nrow(dt))] dt[ eps3 > 0.5 & treated == 1, y := NA] attr(this_matching, 'outcome.var') <- 'y' attr(this_matching, 'data') <- as.data.frame(dt) PE.results2 <- PanelEstimate(sets = this_matching, data = as.data.frame(dt)) plot(PE.results2) abline(h = ATE) `

adamrauh commented 4 years ago

Hi @jhyu1 ,

Sorry for the delay. I really appreciate this helpful reproducible example!

I'm not sure if these will ultimately clear up the problem, but there are a few immediate things we should adjust to see what happens. First, could you try installing the code in the current master branch? There are a handful of updates that aren't on CRAN yet.

  1. What was the intent behind setting lead = -5:4? Negative lead values aren't really supported at the moment -- you should see a new error about this in the latest version of the code. A negative lead value will create some undefined behavior. This was something that's missed in the current CRAN version -- the code runs, but it really shouldn't allow users to do so.

  2. The code assumes that the dataset you pass to PanelMatch and PanelEstimate are identical. It looks like the configuration above involves running PanelMatch, then changing the data set, then running PanelEstimate. I could actually potentially see this explaining the problem, if I understand correctly, but we'll see. I should probably update the documentation to make this assumption more clear.

Could you try adjusting those in the example? Thanks again.

jyf29 commented 4 years ago

Hi @adamrauh,

Thanks! I updated the package, and modified the code with your suggestions, and the same results occurred.

I included a negative lead to check that there were no pre-trends. Can you explain the intuition on why you thought the datasets for PanelMatch and PanelEstimate need to be identical? If only the post-treatment outcomes are NA, that shouldn't affect the matching, right?

Thank you!

Updated code below:

`rm(list = ls()) library(PanelMatch) library(data.table)

ATE = 5 timeperiods = 10 N = 100

set.seed(123) dt = data.table(unit = rep(1:N, each = timeperiods), time = rep(1:timeperiods, N))

units = data.table(unit = 1:N) units[, z := rnorm(nrow(units))] units[, eps :=rnorm(nrow(units)) ] units[, treated := as.numeric((z + eps) > 0.75)]

dt = merge(dt, units, by = 'unit') dt[, eps2 := rnorm(nrow(dt))] dt[time <= 5, treated := 0] dt[, y := ifelse(treated == 1, (ATE * treated) + z + eps2, z + eps2)] dt[, z2 := rnorm(nrow(dt))]

r = lm(y ~ treated, data = dt); summary(r)

this_matching <- PanelMatch(lag = 1, time.id = "time", unit.id = "unit", treatment = "treated", refinement.method = 'CBPS.match', data = as.data.frame(dt), match.missing = T, covs.formula = ~ I(z) + I(z2), size.match = 1, qoi = "att", outcome.var = "eps", lead = 0:4)

results <- this_matching units = (gsub("\..*","",names(results$att))) matchedsets = data.table() for (i in 1:length(units)) { set = data.table(treat = as.numeric(units[i]), control = as.numeric(names(which(attr(results$att[[i]], 'weights') > 0)))) matchedsets = rbind(matchedsets, set) } attr(this_matching, 'outcome.var') <- 'y' PE.results <- PanelEstimate(sets = this_matching, data = as.data.frame(dt)) plot(PE.results) abline(h = ATE)

set.seed(2) dt2 <- dt dt2[, eps3 := rnorm(nrow(dt))] dt2[ eps3 > 0.5 & treated == 1, y := NA] this_matching2 <- PanelMatch(lag = 1, time.id = "time", unit.id = "unit", treatment = "treated", refinement.method = 'CBPS.match', data = as.data.frame(dt2), match.missing = T, covs.formula = ~ I(z) + I(z2), size.match = 1, qoi = "att", outcome.var = "eps", lead = c(0:4)) attr(this_matching2, 'outcome.var') <- 'y' PE.results2 <- PanelEstimate(sets = this_matching2, data = as.data.frame(dt2)) plot(PE.results2) abline(h = ATE)`

adamrauh commented 4 years ago

Hi @jhyu1 thanks for bearing with me here! I'll try to explain some of what's happening under the hood. PanelMatch checks a bunch of expectations/assumptions/restrictions about the data, including some restrictions that need to be met for the PanelEstimate step. This helps to avoid situations where we would need to re-calculate the matched sets. So, some things might not affect the matching procedures in theory, but actually do have an impact the way the code is written.

As an example -- having missing dependent variable data in the lead window would disqualify a unit from being useful at the estimation stage. So, this is checked within PanelMatch and units that have missing data are discarded. This is why the lead window and outcome variable are required arguments. Otherwise, we could have a situation where such units are included, but then at the estimation stage we discover they cant be used -- then we would have to essentially repeat the refinement stage, discarding the units we now know we cant use.

So, if the data changes between PanelMatch and PanelEstimate, PanelEstimate might think a unit/matched set is valid, but it might no longer be the case.

Does that make sense?

All that said, I'm a little confused by the example...sorry. What's the reasoning behind changing the outcome variable from "eps" to "y" after calling PanelMatch?

jyf29 commented 4 years ago

Thank you for this response! This makes sense. In the example, I had set the outcome variable to "eps" in the matching step because "eps" had no missing data, and I had noticed that missing observations in the lead windows caused a different set of matched units, and confused to why post-treatment outcomes would affect the matching procedure. Therefore, by setting the outcome variable to "eps", I had ensured that the matches would be kept the same, regardless of post-treatment. I now see that the code isn't built to be used in that way. Setting the outcome variable in the matching step to "y" did solve the problem, and now the estimated treatment effects are correct in the simulation. Thanks for your time!

adamrauh commented 4 years ago

Great! Yeah, your intuition was correct about the matching — it’s just something that makes sense from an implementation standpoint. We’ll improve the documentation to make this clearer, I think!