setzler / eventStudy

R package and guide for performing event studies with heterogeneous dynamic effects.
MIT License
71 stars 18 forks source link

ES function has side effects #26

Open Asquidy opened 4 years ago

Asquidy commented 4 years ago

When calling the ES function, the underlying data gets modified. Specifically, the calendar time variable with NAs get remapped.

results <- ES(long_data=sim_data2, outcomevar="outcome_plus_dynamic", unit_var="individual", cal_time_var="year", onset_time_var="treatment_year", cluster_vars="individual", never_treat_action="only", linearize_pretrends = TRUE, homogeneous_ATT = TRUE, omitted_event_time = -1)

changes the 'year' column in sim_data2

davidnov commented 4 years ago

Hi @Asquidy, can you post a reproducible example that illustrates the issue?

Asquidy commented 4 years ago

Sure thing! Here it is:

library(data.table) library(ggplot2) library(eventStudy) num_periods <- 10 treatment_time <- 5

sim_data <- ES_simulate_data(units = 1000, min_cal_time = 1, max_cal_time = num_periods)[["observed"]] sim_data[individual %% 2 == 0, treatment_year := treatment_time] sim_data[individual %% 2 == 1, treatment_year := NA] sim_data[, ever_treated := as.numeric(treatment_year == 5 & !is.na(treatment_year))] sim_data[, is_treated := ifelse(!is.na(treatment_year), as.numeric(year >= treatment_year), 0)] mean_outcome <- mean(sim_data$outcome)

unit_trend_avg <- -.1*mean_outcome error_sd_trend <- .5

sim_data[, unit_trend := unit_trend_avg + rnorm(mean = 0, sd = abs(error_sd_trend*unit_trend_avg), n = .N)]

sim_data[, outcome_plus_trend := unit_trend year ever_treated + outcome]

sim_data2 <- copy(sim_data) results <- ES(long_data=sim_data2, outcomevar="outcome_plus_trend", unit_var="individual", cal_time_var="year", onset_time_var="treatment_year", cluster_vars="individual", never_treat_action="only", linearize_pretrends = TRUE, homogeneous_ATT = TRUE, omitted_event_time = -1)

all.equal(sim_data2, sim_data) " "Column 'treatment_year': 'is.NA' value mismatch: 5000 in current 0 in target""

davidnov commented 4 years ago

Hi @Asquidy, you are correct. This has to do with how ES deals with the cases with no onset_time_var -- it essentially assumes that they are treated, but beyond the extent of the calendar times + anticipation period in the panel, and so assigns them a number.

Short answer, if you are running things in a sequence using the same data set (say, ES with several specifications, etc.) and are not memory restricted, I would advise explicitly copying the data within ES, e.g.:

ES(long_data=copy(sim_data2), outcomevar="outcome_plus_trend", unit_var="individual", cal_time_var="year", onset_time_var="treatment_year", cluster_vars="individual", never_treat_action="only", linearize_pretrends = TRUE, homogeneous_ATT = TRUE, omitted_event_time = -1)

Asquidy commented 4 years ago

Thanks a lot for sharing this code! I did indeed copy the data structure. I'd suggest avoiding side-effects in the package if you continue working on it. It was an unexpected behavior for me and would be for other users as well.