sgaure / lfe

Source code repository for the R package lfe on CRAN.
53 stars 18 forks source link

Question about p-values with clustered standard errors in LFE #27

Open Sarastro1804 opened 4 years ago

Sarastro1804 commented 4 years ago

(Note: I asked this on stack overflow before, but did not get any answer to it, https://stackoverflow.com/questions/61250494/question-about-p-values-with-clustered-standard-errors-in-lfe-package-in-r)

I am estimating a model with fixed effects and clustered standard errors using the lfe-package.

As it turns out, I have a huge t-value (23.317) but only a comparatively small p-value (0.0273). This seems to have something to do with me using the projecting out of fixed effects. When I estimate the fixed effects manually as control variables, my p-value is too small to be reported <2e-16 .

Consider the following working example (I am sorry if it's more complicated than strictly necessary, I am trying to be close to my application):

I am simply estimating a pooled panel estimator of 10 time series over 50 periods. And I assume that there are two clusters in the time series.

library(data.table)
library(lfe)

x <- rnorm(50, mean = 1, sd = 1)
common_shock <- rnorm(50, mean = 0, sd = 1)

y1 = 0.5 + 5*x + rnorm(50, mean = 0, sd = 2) + common_shock
y2 = 0.5 + 5*x + rnorm(50, mean = 0, sd = 2) + common_shock
y3 = 0.5 + 5*x + rnorm(50, mean = 0, sd = 2) + common_shock
y4 = 0.5+ 5*x + rnorm(50, mean = 0, sd = 2) + common_shock
y5 = 0.5+ 5*x + rnorm(50, mean = 0, sd = 2) + common_shock
y6 = x + rnorm(50, mean = 0, sd = 2)
y7 = x + rnorm(50, mean = 0, sd = 2)
y8 = x + rnorm(50, mean = 0, sd = 2)
y9 = x + rnorm(50, mean = 0, sd = 2)

y10 = x + rnorm(50, mean = 0, sd = 2)

DT <- data.table(periods = 1:50, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10)
Controls <- data.table(periods = 1:50, x)
indicators <- data.table(y_label = paste0("y", 1:10),
                         indicator = c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0))

DT <- melt(DT, id.vars= c("periods"))

DT <- merge(DT, Controls, by="periods", all = TRUE)
DT <- merge(DT, indicators, by.x="variable", by.y="y_label", all = TRUE)

results <- felm(as.formula("value ~ -1 + indicator + x:indicator  | periods | 0 | periods + indicator"), data = DT)
results2 <- felm(as.formula("value ~ -1 + indicator + x:indicator + as.factor(periods) | 0 | 0 | periods + indicator"), data = DT)
summary(results)
summary(results2)

The first results gives me

indicator:x 3.8625 0.1657 23.317 0.0273 *

The second results2 gives me

indicator:x 3.86252 0.20133 19.185 < 2e-16 ***

So it must be related to the projecting out of fixed effects, but this difference is so huge, that I would like to know a bit more about it. Does someone know what the underlying issue is here?

Sarastro1804 commented 4 years ago

I was using the most recent version 2.8.5 of lfe for the results above. I just learned that this discrepancy between the p-values does not occur for version 2.8.3 . And the issue seems to be tightly linked to the two-way clustering.

karldw commented 4 years ago

Hey @Sarastro1804, I'm not sure of the answer, but it definitely seems like a degrees of freedom question. One way you might think about it is to estimate the p-value by simulation: set up the data so the expected value of the x:indicator coefficient is zero, run the regression, record the resulting coefficient, and repeat 10000 times. (Set nostats=TRUE in the felm call to speed things up a little.) Then find where your actual x:indicator coefficient falls in the distribution of 10000.