insongkim / PanelMatch

113 stars 34 forks source link

Why such different results with manually-set lead DV? #50

Open marcgrinberg opened 4 years ago

marcgrinberg commented 4 years ago

1) Run PanelMatch code from README

> PM.results <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
+                          treatment = "dem", refinement.method = "mahalanobis", 
+                          data = dem, match.missing = T, 
+                          covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
+                          size.match = 5, qoi = "att" ,outcome.var = "y",
+                          lead = 0:4, forbid.treatment.reversal = FALSE)
> PE.results <- PanelEstimate(inference = "bootstrap", sets = PM.results, 
+                             data = dem)
> PE.results$coefficients
       t+0        t+1        t+2        t+3        t+4 
-0.8913640 -0.4709856  0.4803681  1.3447573  1.0782767

2) Create a lead version of y

> test<-as.data.table(dem)
> setorder(test, wbcode2, year)
> test[,y1:=shift(y,1,type="lead"), by=c("wbcode2")]

3) Run PanelMatch code from README but with outcome.var="y1" instead of ="y" (also change dataset to "test")

> PM.results_shift <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
+                           treatment = "dem", refinement.method = "mahalanobis", 
+                           data = as.data.frame(test), match.missing = T, 
+                           covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
+                           size.match = 5, qoi = "att" ,outcome.var = "y1",
+                           lead = 0:4, forbid.treatment.reversal = FALSE)
> PE.results_shift <- PanelEstimate(inference = "bootstrap", sets = PM.results_shift, 
+                              data = as.data.frame(test))
> PE.results_shift$coefficients
      t+0       t+1       t+2       t+3       t+4 
0.5864504 1.6472727 2.5874706 2.5000804 2.7015711

DISCUSSION: My expectation was that the coefficients estimates from PE.results would be (essentially) the same as the estimates from PE.results_shift but shifted by a year. So PE.results at t+1, t+2,... would equal PE.results_shift at t+0, t+1,... But the estimates are dramatically different (in the first case the wrong sign).

On a quick visual scan, the matched sets are the same in PM.results and PM.results_shift.

y and y1 have similar distributions

> summary(test[,.(y,y1)])
       y                y1        
 Min.   : 405.7   Min.   : 405.7  
 1st Qu.: 620.9   1st Qu.: 621.4  
 Median : 740.7   Median : 741.6  
 Mean   : 748.3   Mean   : 748.9  
 3rd Qu.: 867.0   3rd Qu.: 867.4  
 Max.   :1094.0   Max.   :1094.0  
 NA's   :2241     NA's   :2334

Any idea why the coefficients are so different?