Closed etiennebacher closed 3 years ago
Hi!
For the first question: yes, you are correct fepois == ppmlhdfe
.
Regarding IV-Poisson, well, it can be done with first stage OLS, but it's simply not currently implemented.
The big problem is that, to the best of my knowledge (and I'm not an econometrician so said knowledge is fairly limited), there is no theoretically valid analytical covariance for non-linear IV models. This means that the VCOV has to rely on bootstrapping. And since bootstrapping isn't yet implemented in the package, IV-glm will wait until then.
However, given the recurrence of this question, I'll give a hacky way to get the IV-Poisson estimates:
# Function to get the estimates
ivpois_est = function(fml, data){
# 1st + 2nd stage OLS
est_ols = feols(fml, data)
# Getting the residuals
resid_1st_stage = resid(summary(est_ols, stage = 1))
# 2nd stage poisson, we add the 1st stage residuals as regressors
y = model.matrix(est_ols, type = "lhs")
RHS = model.matrix(est_ols, type = c("iv.endo", "iv.exo"))
RHS_control_fun = cbind(RHS, resid_1st_stage)
# FEs, if there // note that vars with varying slopes won't work
FE = if(!is.null(est_ols$fixef_vars)) model.matrix(est_ols, type = "fixef") else NULL
# 2nd stage
feglm.fit(y, RHS_control_fun, FE, family = "poisson")
}
# Function to get the VCOV
ivpois_vcov = function(fml, data, nrep){
# This is just to give an example, there are many BS schemes possible
n = nrow(data)
all_coef = vector("list", nrep)
for(i in 1:nrep){
all_coef[[i]] = coef(ivpois_est(fml, data[sample(n, n/2), ]))
}
all_coef_mat = do.call("rbind", all_coef)
var(all_coef_mat)
}
# Function to get both the VCOV and the estimates
ivpois = function(fml, data, nrep = 200){
res = ivpois_est(fml, data)
vcov = ivpois_vcov(fml, data, nrep)
summary(res, .vcov = vcov)
}
# Example
base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")
est_noIV = fepois(y ~ endo + x1 | fe, base)
est_IV = ivpois(y ~ x1 | fe | endo ~ inst, base)
est_IV_noboot = ivpois_est(y ~ x1 | fe | endo ~ inst, base)
etable(est_noIV, est_IV, est_IV_noboot)
#> est_noIV est_IV est_IV_noboot
#> Dependent Var.: y y y
#>
#> endo 0.1221*** (0.0150) 0.0701. (0.0409) 0.0701** (0.0272)
#> x1 0.0778* (0.0339) 0.1033*** (0.0251) 0.1033** (0.0314)
#> resid_1st_stage 0.0599 (0.0406) 0.0599** (0.0222)
#> Fixed-Effects: ------------------ ------------------ -----------------
#> fe Yes Yes Yes
#> _______________ __________________ __________________ _________________
#> S.E. type by: fe Custom by: fe
#> Observations 150 150 150
#> Squared Cor. 0.86314 0.86715 0.86715
#> Pseudo R2 0.02676 0.02687 0.02687
#> BIC 570.76 575.71 575.71
In the previous code chunk are three functions: the first to get the estimates, the second to get the VCOV via bootstrapping and finally the last one which calls the two previously defined functions to obtain both the estimates and the VCOV.
I must say that this solution is excruciatingly slow. The use case of the package is "one estimation on a very large data set with many many fixed-effects" and not "multiple small estimations". This means that the user-level functions (feols
, fepois
, etc) suck at bootstrapping because of the overheads they incur (typically a few ms due to numerous user-friendly consistency checks + non-standard evaluations). This said, of course as the data sets grow in size, the overheads become less and less problematic runtime-wise.
Furthermore this solution computes a lot of stuff that's not needed and which can be redundant, and finally it does not take advantage of some numerical tricks to hasten bootsrapping (like QR updating).
But it works: for any number of endogenous regressors, with/without fixed-effects.
Of course you have to adapt many things but I think it can be used as a good starting point.
I hope you'll enjoy the package and that the previous code will help!
Thanks a lot for the detailed answer and the workaround! Just in case someone else wants to use it, the code above doesn't work with multiple regressions, i.e when using csw()
, sw()
, etc, in the formula. This can probably be adapted though.
I have a question about the output of your custom IV poisson functions. Why is there resid_1st_stage
in the regressors for the second step for IV poisson but not in the regressors for "basic" IV?
base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")
# Basic IV
est_iv = feols(y ~ x1 | fe | endo ~ inst, base)
est_iv
etable(est_iv)
# est_iv
# Dependent Var.: y
#
# endo 0.4487** (0.0288)
# x1 0.5887* (0.0600)
# Fixed-Effects: -----------------
# fe Yes
# _______________ _________________
# S.E.: Clustered by: fe
# Observations 150
# R2 0.83890
# Within R2 0.57748
# Poisson IV
est_iv_pois = ivpois(y ~ x1 | fe | endo ~ inst, base)
est_iv_pois
etable(est_iv_pois)
# est_iv_pois
# Dependent Var.: y
#
# endo 0.0701. (0.0411)
# x1 0.1033*** (0.0246)
# resid_1st_stage 0.0599 (0.0409)
# Fixed-Effects: ------------------
# fe Yes
# _______________ __________________
# S.E. type Custom
# Observations 150
# Squared Cor. 0.86715
# Pseudo R2 0.02687
# BIC 575.71
What does the value of resid_1st_stage
mean? Sorry if this is a really basic question, I am just surprised it appears only with Poisson IV.
@etiennebacher I think you would like to check pages 7 and 8 of this document of Wooldridge http://econ.msu.edu/faculty/wooldridge/docs/qmle_endog_r3.pdf . The short answer to your question is: there is no reason why you would do etable(est_iv), you only want to get the residuals of that regression and plug them into the second stage regression.
The idea is that in a first stage the endogenous variable (endo) is regressed (by OLS) on the instruments (inst) and other explanatory variables (x1, fe), and obtain the residuals of this regression (this is done in line: resid_1st_stage = resid(summary(est_ols, stage = 1) ).
Then these residuals would then be added to the second stage regression, and would control for the endogeneity of the variable endo. So in the second stage regression you have as explanatory endo, x1, resid_1st_stage, fe.
The current code fails with multiple endogenous variables due to residuals matrix not having column names. I would suggest the following change:
replace: resid_1st_stage = resid(summary(est_ols, stage = 1))
by: resid_1st_stage = as.data.frame(resid(summary(est_ols, stage = 1))) colnames(resid_1st_stage) <- paste('resid', est_ols$iv_endonames, sep='')
@leucotheas Thank you for the link and for the explanation, but all is not clear to me.
Then these residuals would then be added to the second stage regression, and would control for the endogeneity of the variable endo. So in the second stage regression you have as explanatory endo, x1, resid_1st_stage, fe.
Why isn't there only x1, resid_first_stage and fe in the second stage? endo
shouldn't be in the second stage, since this variable is why we needed 2SLS in the first place. For instance, in this example:
base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")
# Basic IV
est_iv = feols(y ~ x1 | fe | endo ~ inst, base)
summary()
shows that there are two explanatory variables in the second stage: fit_endo (first stage residuals if I'm correct), and x1.
> summary(est_iv, stage = 2)
TSLS estimation, Dep. Var.: y, Endo.: endo, Instr.: inst
Second stage: Dep. Var.: y
Observations: 150
Fixed-effects: fe: 3
Standard-errors: Clustered (fe)
Estimate Std. Error t value Pr(>|t|))
fit_endo 0.448687 0.028795 15.5820 0.004093 **
x1 0.588746 0.060040 9.8059 0.010240 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.331258 Adj. R2: 0.834451
Within R2: 0.57748
F-test (1st stage), endo: stat = 24.1 , p = 2.397e-6, on 1 and 147 DoF.
Wu-Hausman: stat = 4.3448, p = 0.038888, on 1 and 144 DoF.
The reason I'm surprised with ivpois()
output is that it gives 3 variables: endo, x1 and resid_1st_stage:
est_iv_pois = ivpois(y ~ x1 | fe | endo ~ inst, base)
> summary(est_iv_pois)
GLM estimation, family = poisson, Dep. Var.: y
Observations: 150
Fixed-effects: fe: 3
Standard-errors: Custom
Estimate Std. Error t value Pr(>|t|))
endo 0.070064 0.043244 1.6202 0.105189
x1 0.103268 0.025451 4.0574 0.000050 ***
resid_1st_stage 0.059865 0.042962 1.3934 0.163487
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -272.8 Adj. Pseudo R2: 0.009035
BIC: 575.7 Squared Cor.: 0.867152
Shouldn't endo be removed from the list of explanatory variables in the second stage and replaced by resid_1st_stage, as it is the case in "basic IV"?
@etiennebacher Check equation 2.8 in page 8 http://econ.msu.edu/faculty/wooldridge/docs/qmle_endog_r3.pdf . The endogenous variable is not fitted as in 2SLS, instead both the endogenous variable and the residuals are included. The residuals would control for the endogenous component of the endogenous variable.
@etiennebacher: it's simply different methods. As I mentioned, it is the control function approach, not two-stages least squares. You cannot copy-paste the IV-OLS strategy to non-linear models (actually you can, but it'd give you wrong results).
In OLS, either a) using the 1st stage predicted variable as the 2nd stage predictor, or b) using the 1st stage residual as a regressor, lead to the same point estimates as shown in this example (note that in both cases the SEs are not correct however):
base = iris
names(base) = c("y", "x1", "endo", "inst", "fe")
# OLS => equivalence
# Stage 1:
est_stage_1 = feols(endo ~ x1 + inst, base)
base$fit_endo = predict(est_stage_1)
base$resid_1 = resid(est_stage_1)
# Stages 2:
est_stage_2 = feols(y ~ x1 + fit_endo, base)
est_stage_2_cf = feols(y ~ x1 + endo + resid_1, base)
etable(est_stage_2, est_stage_2_cf)
#> est_stage_2 est_stage_2_cf
#> Dependent Var.: y y
#>
#> (Intercept) 2.439*** (0.3414) 2.439*** (0.2380)
#> x1 0.5592*** (0.0946) 0.5592*** (0.0660)
#> endo 0.4510*** (0.0242) 0.4510*** (0.0168)
#> resid_1 0.2582*** (0.0592)
#> _______________ __________________ __________________
#> S.E. type IID IID
#> Observations 150 150
#> R2 0.70724 0.85861
#> Adj. R2 0.70325 0.85571
This equivalence does not hold anymore with non-linear models:
# poisson => equivalence not holding
# Stage 1: same as OLS
# Stages 2:
est_pois_stage_2 = fepois(y ~ x1 + fit_endo, base)
est_pois_stage_2_cf = fepois(y ~ x1 + endo + resid_1, base)
etable(est_pois_stage_2, est_pois_stage_2_cf)
#> est_stage_2 est_stage_2_cf
#> Dependent Var.: y y
#>
#> (Intercept) 1.199*** (0.3149) 1.190*** (0.3165)
#> x1 0.0891 (0.0877) 0.0904 (0.0879)
#> endo 0.0764*** (0.0221) 0.0773*** (0.0224)
#> resid_1 0.0394 (0.0752)
#> _______________ __________________ __________________
#> S.E. type IID IID
#> Observations 150 150
#> Squared Cor. 0.69341 0.86080
#> Pseudo R2 0.02194 0.02669
#> BIC 563.44 565.79
Econometric theory suggests to use the residual inclusion for non-linear models.
It's an econometrics question, so for more details I suggest you to post a question on CrossValidated.
@leucotheas: Actually it was a bug in feols.fit
, now corrected so that the code needs not to be updated (although I agree that having more explicit variable names wouldn't hurt!).
@lrberge @leucotheas Got it, thank you for your answers
@lrberge @etiennebacher I just wanted to clarify, in the previous document that I mentioned (http://econ.msu.edu/faculty/wooldridge/docs/qmle_endog_r3.pdf ) it is actually equation 4.17 that corresponds to Poisson estimation, which if I understand well is what @lrberge is doing in the code suggested
Hello, thanks for this package! Although I haven't used it a lot so far, it seems very efficient and full of nice functionalities.
I would like to regress bilateral migration flows on various explanatory variables, and therefore I would like to use the Poisson Pseudo Maximum Likelihood (PPML), that is common in the literature. In Stata, I would use
ppmlhdfe
. If I am correct, here I have to usefepois
, which is the function you compared toppmlhdfe
in the benchmark. First of all, is this correct?I suspect that one of my explanatory variable is endogenous, and therefore I would like to instrument it. However, it is not possible to do with
fepois
:I guess this should be possible, if I rely on what Santos Silva & Tenreyro (2006) said in the paper where they recommended the use of PPML (p.646, footnote 16):
How can I both estimate a PPML model and instrument my endogenous variable with
fepois
?Note that I am not sure whether it is possible to do so with
ppmlhdfe
.