lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
378 stars 60 forks source link

Use instrumental variable with fepois() #176

Closed etiennebacher closed 3 years ago

etiennebacher commented 3 years ago

Hello, thanks for this package! Although I haven't used it a lot so far, it seems very efficient and full of nice functionalities.

I would like to regress bilateral migration flows on various explanatory variables, and therefore I would like to use the Poisson Pseudo Maximum Likelihood (PPML), that is common in the literature. In Stata, I would use ppmlhdfe. If I am correct, here I have to use fepois, which is the function you compared to ppmlhdfe in the benchmark. First of all, is this correct?

I suspect that one of my explanatory variable is endogenous, and therefore I would like to instrument it. However, it is not possible to do with fepois:

Error in fepois(flow ~ csw0(gdp_ppp, pop, dist, comlang_off, colony, diasp_2001) |  : 
  The argument 'fml' cannot contain more than two parts separated by a pipe ('|'). IVs are available only for 'feols'.
The syntax is: DEP VAR ~ EXPL VARS | FIXED EFFECTS. (No IVs allowed.)

I guess this should be possible, if I rely on what Santos Silva & Tenreyro (2006) said in the paper where they recommended the use of PPML (p.646, footnote 16):

It is worth noting that the PPML estimator can be easily adapted to deal with endogenous regressors (Windmeijer & Santos Silva, 1997) and panel data (Wooldridge, 1999).

How can I both estimate a PPML model and instrument my endogenous variable with fepois?

Note that I am not sure whether it is possible to do so with ppmlhdfe.

lrberge commented 3 years ago

Hi!

For the first question: yes, you are correct fepois == ppmlhdfe.

Regarding IV-Poisson, well, it can be done with first stage OLS, but it's simply not currently implemented.

The big problem is that, to the best of my knowledge (and I'm not an econometrician so said knowledge is fairly limited), there is no theoretically valid analytical covariance for non-linear IV models. This means that the VCOV has to rely on bootstrapping. And since bootstrapping isn't yet implemented in the package, IV-glm will wait until then.

However, given the recurrence of this question, I'll give a hacky way to get the IV-Poisson estimates:

# Function to get the estimates
ivpois_est = function(fml, data){
    # 1st + 2nd stage OLS
    est_ols = feols(fml, data)

    # Getting the residuals
    resid_1st_stage = resid(summary(est_ols, stage = 1))

    # 2nd stage poisson, we add the 1st stage residuals as regressors
    y = model.matrix(est_ols, type = "lhs")
    RHS = model.matrix(est_ols, type = c("iv.endo", "iv.exo"))
    RHS_control_fun = cbind(RHS, resid_1st_stage)

    # FEs, if there // note that vars with varying slopes won't work
    FE = if(!is.null(est_ols$fixef_vars)) model.matrix(est_ols, type = "fixef") else NULL

    # 2nd stage
    feglm.fit(y, RHS_control_fun, FE, family = "poisson")
}

# Function to get the VCOV
ivpois_vcov = function(fml, data, nrep){
    # This is just to give an example, there are many BS schemes possible

    n = nrow(data)
    all_coef = vector("list", nrep)

    for(i in 1:nrep){
        all_coef[[i]] = coef(ivpois_est(fml, data[sample(n, n/2), ]))
    }

    all_coef_mat = do.call("rbind", all_coef)

    var(all_coef_mat)
}

# Function to get both the VCOV and the estimates
ivpois = function(fml, data, nrep = 200){

    res = ivpois_est(fml, data)

    vcov = ivpois_vcov(fml, data, nrep)

    summary(res, .vcov = vcov)
}

# Example
base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")

est_noIV = fepois(y ~ endo + x1 | fe, base)
est_IV = ivpois(y ~ x1 | fe | endo ~ inst, base)
est_IV_noboot = ivpois_est(y ~ x1 | fe | endo ~ inst, base)

etable(est_noIV, est_IV, est_IV_noboot)
#>                           est_noIV             est_IV     est_IV_noboot
#> Dependent Var.:                  y                  y                 y
#>                                                                        
#> endo            0.1221*** (0.0150)   0.0701. (0.0409) 0.0701** (0.0272)
#> x1                0.0778* (0.0339) 0.1033*** (0.0251) 0.1033** (0.0314)
#> resid_1st_stage                       0.0599 (0.0406) 0.0599** (0.0222)
#> Fixed-Effects:  ------------------ ------------------ -----------------
#> fe                             Yes                Yes               Yes
#> _______________ __________________ __________________ _________________
#> S.E. type                   by: fe             Custom            by: fe
#> Observations                   150                150               150
#> Squared Cor.               0.86314            0.86715           0.86715
#> Pseudo R2                  0.02676            0.02687           0.02687
#> BIC                         570.76             575.71            575.71

In the previous code chunk are three functions: the first to get the estimates, the second to get the VCOV via bootstrapping and finally the last one which calls the two previously defined functions to obtain both the estimates and the VCOV.

I must say that this solution is excruciatingly slow. The use case of the package is "one estimation on a very large data set with many many fixed-effects" and not "multiple small estimations". This means that the user-level functions (feols, fepois, etc) suck at bootstrapping because of the overheads they incur (typically a few ms due to numerous user-friendly consistency checks + non-standard evaluations). This said, of course as the data sets grow in size, the overheads become less and less problematic runtime-wise. Furthermore this solution computes a lot of stuff that's not needed and which can be redundant, and finally it does not take advantage of some numerical tricks to hasten bootsrapping (like QR updating). But it works: for any number of endogenous regressors, with/without fixed-effects.

Of course you have to adapt many things but I think it can be used as a good starting point.

I hope you'll enjoy the package and that the previous code will help!

etiennebacher commented 3 years ago

Thanks a lot for the detailed answer and the workaround! Just in case someone else wants to use it, the code above doesn't work with multiple regressions, i.e when using csw(), sw(), etc, in the formula. This can probably be adapted though.

etiennebacher commented 3 years ago

I have a question about the output of your custom IV poisson functions. Why is there resid_1st_stage in the regressors for the second step for IV poisson but not in the regressors for "basic" IV?

base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")

# Basic IV
est_iv = feols(y ~ x1 | fe | endo ~ inst, base)
est_iv
etable(est_iv)

#                            est_iv
# Dependent Var.:                 y
# 
# endo            0.4487** (0.0288)
# x1               0.5887* (0.0600)
# Fixed-Effects:  -----------------
# fe                            Yes
# _______________ _________________
# S.E.: Clustered            by: fe
# Observations                  150
# R2                        0.83890
# Within R2                 0.57748

# Poisson IV
est_iv_pois = ivpois(y ~ x1 | fe | endo ~ inst, base)
est_iv_pois
etable(est_iv_pois)

#                        est_iv_pois
# Dependent Var.:                  y
# 
# endo              0.0701. (0.0411)
# x1              0.1033*** (0.0246)
# resid_1st_stage    0.0599 (0.0409)
# Fixed-Effects:  ------------------
# fe                             Yes
# _______________ __________________
# S.E. type                   Custom
# Observations                   150
# Squared Cor.               0.86715
# Pseudo R2                  0.02687
# BIC                         575.71

What does the value of resid_1st_stage mean? Sorry if this is a really basic question, I am just surprised it appears only with Poisson IV.

leucotheas commented 3 years ago

@etiennebacher I think you would like to check pages 7 and 8 of this document of Wooldridge http://econ.msu.edu/faculty/wooldridge/docs/qmle_endog_r3.pdf . The short answer to your question is: there is no reason why you would do etable(est_iv), you only want to get the residuals of that regression and plug them into the second stage regression.

The idea is that in a first stage the endogenous variable (endo) is regressed (by OLS) on the instruments (inst) and other explanatory variables (x1, fe), and obtain the residuals of this regression (this is done in line: resid_1st_stage = resid(summary(est_ols, stage = 1) ).

Then these residuals would then be added to the second stage regression, and would control for the endogeneity of the variable endo. So in the second stage regression you have as explanatory endo, x1, resid_1st_stage, fe.

The current code fails with multiple endogenous variables due to residuals matrix not having column names. I would suggest the following change:

replace: resid_1st_stage = resid(summary(est_ols, stage = 1))

by: resid_1st_stage = as.data.frame(resid(summary(est_ols, stage = 1))) colnames(resid_1st_stage) <- paste('resid', est_ols$iv_endonames, sep='')

etiennebacher commented 3 years ago

@leucotheas Thank you for the link and for the explanation, but all is not clear to me.

Then these residuals would then be added to the second stage regression, and would control for the endogeneity of the variable endo. So in the second stage regression you have as explanatory endo, x1, resid_1st_stage, fe.

Why isn't there only x1, resid_first_stage and fe in the second stage? endo shouldn't be in the second stage, since this variable is why we needed 2SLS in the first place. For instance, in this example:

base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")

# Basic IV
est_iv = feols(y ~ x1 | fe | endo ~ inst, base)

summary() shows that there are two explanatory variables in the second stage: fit_endo (first stage residuals if I'm correct), and x1.

> summary(est_iv, stage = 2)
TSLS estimation, Dep. Var.: y, Endo.: endo, Instr.: inst
Second stage: Dep. Var.: y
Observations: 150 
Fixed-effects: fe: 3
Standard-errors: Clustered (fe) 
         Estimate Std. Error t value Pr(>|t|))    
fit_endo 0.448687   0.028795 15.5820  0.004093 ** 
x1       0.588746   0.060040  9.8059  0.010240 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.331258     Adj. R2: 0.834451
                 Within R2: 0.57748 
F-test (1st stage), endo: stat = 24.1   , p = 2.397e-6, on 1 and 147 DoF.
              Wu-Hausman: stat =  4.3448, p = 0.038888, on 1 and 144 DoF.

The reason I'm surprised with ivpois() output is that it gives 3 variables: endo, x1 and resid_1st_stage:

est_iv_pois = ivpois(y ~ x1 | fe | endo ~ inst, base)

> summary(est_iv_pois)
GLM estimation, family = poisson, Dep. Var.: y
Observations: 150 
Fixed-effects: fe: 3
Standard-errors: Custom 
                Estimate Std. Error t value Pr(>|t|))    
endo            0.070064   0.043244  1.6202  0.105189    
x1              0.103268   0.025451  4.0574  0.000050 ***
resid_1st_stage 0.059865   0.042962  1.3934  0.163487    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -272.8   Adj. Pseudo R2: 0.009035
           BIC:  575.7     Squared Cor.: 0.867152

Shouldn't endo be removed from the list of explanatory variables in the second stage and replaced by resid_1st_stage, as it is the case in "basic IV"?

leucotheas commented 3 years ago

@etiennebacher Check equation 2.8 in page 8 http://econ.msu.edu/faculty/wooldridge/docs/qmle_endog_r3.pdf . The endogenous variable is not fitted as in 2SLS, instead both the endogenous variable and the residuals are included. The residuals would control for the endogenous component of the endogenous variable.

lrberge commented 3 years ago

@etiennebacher: it's simply different methods. As I mentioned, it is the control function approach, not two-stages least squares. You cannot copy-paste the IV-OLS strategy to non-linear models (actually you can, but it'd give you wrong results).

In OLS, either a) using the 1st stage predicted variable as the 2nd stage predictor, or b) using the 1st stage residual as a regressor, lead to the same point estimates as shown in this example (note that in both cases the SEs are not correct however):

base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")

# OLS => equivalence

# Stage 1:
est_stage_1 = feols(endo ~ x1 + inst, base)

base$fit_endo = predict(est_stage_1)
base$resid_1  = resid(est_stage_1)

# Stages 2:
est_stage_2    = feols(y ~ x1 + fit_endo, base)
est_stage_2_cf = feols(y ~ x1 + endo + resid_1, base)

etable(est_stage_2, est_stage_2_cf)
#>                        est_stage_2     est_stage_2_cf
#> Dependent Var.:                  y                  y
#>                                                      
#> (Intercept)      2.439*** (0.3414)  2.439*** (0.2380)
#> x1              0.5592*** (0.0946) 0.5592*** (0.0660)
#> endo            0.4510*** (0.0242) 0.4510*** (0.0168)
#> resid_1                            0.2582*** (0.0592)
#> _______________ __________________ __________________
#> S.E. type                      IID                IID
#> Observations                   150                150
#> R2                         0.70724            0.85861
#> Adj. R2                    0.70325            0.85571

This equivalence does not hold anymore with non-linear models:

# poisson => equivalence not holding

# Stage 1: same as OLS

# Stages 2:
est_pois_stage_2    = fepois(y ~ x1 + fit_endo, base)
est_pois_stage_2_cf = fepois(y ~ x1 + endo + resid_1, base)

etable(est_pois_stage_2, est_pois_stage_2_cf)
#>                        est_stage_2     est_stage_2_cf
#> Dependent Var.:                  y                  y
#>
#> (Intercept)      1.199*** (0.3149)  1.190*** (0.3165)
#> x1                 0.0891 (0.0877)    0.0904 (0.0879)
#> endo            0.0764*** (0.0221) 0.0773*** (0.0224)
#> resid_1                               0.0394 (0.0752)
#> _______________ __________________ __________________
#> S.E. type                      IID                IID
#> Observations                   150                150
#> Squared Cor.               0.69341            0.86080
#> Pseudo R2                  0.02194            0.02669
#> BIC                         563.44             565.79

Econometric theory suggests to use the residual inclusion for non-linear models.

It's an econometrics question, so for more details I suggest you to post a question on CrossValidated.

@leucotheas: Actually it was a bug in feols.fit, now corrected so that the code needs not to be updated (although I agree that having more explicit variable names wouldn't hurt!).

etiennebacher commented 3 years ago

@lrberge @leucotheas Got it, thank you for your answers

leucotheas commented 3 years ago

@lrberge @etiennebacher I just wanted to clarify, in the previous document that I mentioned (http://econ.msu.edu/faculty/wooldridge/docs/qmle_endog_r3.pdf ) it is actually equation 4.17 that corresponds to Poisson estimation, which if I understand well is what @lrberge is doing in the code suggested