py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/
MIT License
172 stars 35 forks source link

Differences in IV estimation relative to linearmodels #197

Closed aeturrell closed 1 year ago

aeturrell commented 1 year ago

I'm working on an IV example that was previously working in linearmodels but doesn't solve in pyfixest. It's possible that I've just misunderstood the syntax though!

reprex:

In this case, the model will be

$$ \text{Price}_i = \hat{\pi_0} + \hat{\pi_1} \text{SalesTax}_i + v_i $$

in the first stage regression and

$$ \text{Packs}_i = \hat{\beta_0} + \hat{\beta_2}\widehat{\text{Price}_i} + \hat{\beta_1} \text{RealIncome}_i + u_i $$

in the second stage.

Data:

import pandas as pd
from linearmodels.iv import IV2SLS

dfiv = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv",
    dtype={"state": "category", "year": "category"},
).assign(
    rprice=lambda x: x["price"] / x["cpi"],
    rincome=lambda x: x["income"] / x["population"] / x["cpi"],
)
dfiv.head()

linearmodels runs okay:

results_iv2sls = IV2SLS.from_formula(
    "np.log(packs) ~ 1 + np.log(rincome) + C(year) + C(state) + [np.log(rprice) ~ taxs]",
    df,
).fit(cov_type="clustered", clusters=df["year"])
print(results_iv2sls.summary)
                          IV-2SLS Estimation Summary                          
==============================================================================
Dep. Variable:          np.log(packs)   R-squared:                      0.9659
Estimator:                    IV-2SLS   Adj. R-squared:                 0.9279
No. Observations:                  96   F-statistic:                -1.296e+17
Date:                Thu, Oct 26 2023   P-value (F-stat)                1.0000
Time:                        09:31:50   Distribution:                 chi2(50)
Cov. Estimator:             clustered                                         

                                Parameter Estimates                                
===================================================================================
                 Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------
Intercept           9.4924     0.0263     360.24     0.0000      9.4407      9.5440
np.log(rincome)     0.4434                                                         
C(year)[T.1995]    -0.0328                                                         
C(state)[T.AR]      0.1770     0.0531     3.3338     0.0009      0.0729      0.2810
C(state)[T.AZ]     -0.0899     0.0132    -6.8132     0.0000     -0.1158     -0.0640
C(state)[T.CA]     -0.2781     0.0214    -12.996     0.0000     -0.3200     -0.2361
C(state)[T.CO]     -0.2479     0.0090    -27.625     0.0000     -0.2655     -0.2303
C(state)[T.CT]     -0.0171     0.0196    -0.8720     0.3832     -0.0556      0.0213
C(state)[T.DE]      0.1110     0.0291     3.8105     0.0001      0.0539      0.1682
C(state)[T.FL]      0.0762     0.0142     5.3596     0.0000      0.0483      0.1041
C(state)[T.GA]     -0.0695     0.0251    -2.7706     0.0056     -0.1186     -0.0203
C(state)[T.IA]      0.0120     0.0739     0.1629     0.8706     -0.1328      0.1569
C(state)[T.ID]     -0.1272     0.0077    -16.597     0.0000     -0.1423     -0.1122
C(state)[T.IL]     -0.0339     0.0081    -4.1912     0.0000     -0.0497     -0.0180
C(state)[T.IN]      0.1198     0.0611     1.9609     0.0499   5.573e-05      0.2395
C(state)[T.KS]     -0.0910     0.0305    -2.9884     0.0028     -0.1507     -0.0313
C(state)[T.KY]      0.3525     0.0631     5.5906     0.0000      0.2289      0.4761
C(state)[T.LA]      0.1315     0.0104     12.664     0.0000      0.1112      0.1519
C(state)[T.MA]     -0.0403     0.0069    -5.8826     0.0000     -0.0538     -0.0269
C(state)[T.MD]     -0.2322     0.0239    -9.7376     0.0000     -0.2790     -0.1855
C(state)[T.ME]      0.2008     0.0574     3.5011     0.0005      0.0884      0.3133
C(state)[T.MI]      0.1268     0.0745     1.7009     0.0890     -0.0193      0.2728
C(state)[T.MN]      0.0568     0.0490     1.1595     0.2463     -0.0392      0.1529
C(state)[T.MO]      0.0640     0.0476     1.3454     0.1785     -0.0292      0.1572
C(state)[T.MS]      0.1501     0.0272     5.5267     0.0000      0.0969      0.2034
C(state)[T.MT]     -0.1522     0.0054    -28.250     0.0000     -0.1627     -0.1416
C(state)[T.NC]      0.0396     0.0191     2.0655     0.0389      0.0020      0.0771
C(state)[T.ND]     -0.0311     0.0399    -0.7787     0.4361     -0.1092      0.0471
C(state)[T.NE]     -0.0741     0.0375    -1.9765     0.0481     -0.1476     -0.0006
C(state)[T.NH]      0.3504     0.0315     11.114     0.0000      0.2886      0.4122
C(state)[T.NJ]     -0.0873  6.107e-05    -1429.3     0.0000     -0.0874     -0.0872
C(state)[T.NM]     -0.2858     0.0040    -71.049     0.0000     -0.2937     -0.2779
C(state)[T.NV]      0.1789     0.0259     6.9075     0.0000      0.1281      0.2296
C(state)[T.NY]     -0.0719     0.0032    -22.256     0.0000     -0.0782     -0.0655
C(state)[T.OH]      0.0325     0.0402     0.8088     0.4186     -0.0463      0.1114
C(state)[T.OK]      0.0946     0.0538     1.7572     0.0789     -0.0109      0.2000
C(state)[T.OR]     -0.0153     0.0673    -0.2269     0.8205     -0.1471      0.1166
C(state)[T.PA]     -0.0031     0.0006    -4.8401     0.0000     -0.0044     -0.0019
C(state)[T.RI]      0.1394     0.0921     1.5136     0.1301     -0.0411      0.3200
C(state)[T.SC]     -0.0212     0.0334    -0.6345     0.5257     -0.0866      0.0442
C(state)[T.SD]     -0.0675     0.0711    -0.9488     0.3427     -0.2069      0.0719
C(state)[T.TN]      0.1473     0.0470     3.1340     0.0017      0.0552      0.2394
C(state)[T.TX]     -0.0579     0.0136    -4.2560     0.0000     -0.0845     -0.0312
C(state)[T.UT]     -0.4899     0.0276    -17.776     0.0000     -0.5440     -0.4359
C(state)[T.VA]     -0.0559     0.0471    -1.1875     0.2350     -0.1482      0.0364
C(state)[T.VT]      0.2209     0.0467     4.7267     0.0000      0.1293      0.3125
C(state)[T.WA]      0.0064     0.0011     6.0151     0.0000      0.0043      0.0085
C(state)[T.WI]      0.0741     0.0590     1.2569     0.2088     -0.0415      0.1897
C(state)[T.WV]      0.1576     0.0582     2.7097     0.0067      0.0436      0.2716
C(state)[T.WY]     -0.0169     0.0590    -0.2858     0.7750     -0.1325      0.0988
np.log(rprice)     -1.2793                                                         
===================================================================================

Endogenous: np.log(rprice)
Instruments: taxs
Clustered Covariance (One-Way)
Debiased: False
Num Clusters: 2

pyfixest produces an "UnderDeterminedIVError". Code:

results_iv = feols("np.log(packs) ~ 1 + np.log(rincome) | C(year) + C(state) | np.log(rprice) ~ taxs ", data=dfiv, vcov={"CRV1": "year"})
results_iv.summary()
---------------------------------------------------------------------------
UnderDeterminedIVError                    Traceback (most recent call last)
[/Users/aet/Documents/git_projects/coding-for-economists/econmt-regression.ipynb](https://file+.vscode-resource.vscode-cdn.net/Users/aet/Documents/git_projects/coding-for-economists/econmt-regression.ipynb) Cell 67 line 1
----> [1](vscode-notebook-cell:/Users/aet/Documents/git_projects/coding-for-economists/econmt-regression.ipynb#Y340sZmlsZQ%3D%3D?line=0) results_iv = feols("np.log(packs) ~ 1 + np.log(rincome) + C(year) + C(state) | np.log(rprice) ~ taxs ", data=dfiv, vcov={"CRV1": "year"})
      [2](vscode-notebook-cell:/Users/aet/Documents/git_projects/coding-for-economists/econmt-regression.ipynb#Y340sZmlsZQ%3D%3D?line=1) results_iv.summary()

File [~/mambaforge/envs/codeforecon/lib/python3.10/site-packages/pyfixest/estimation.py:129](https://file+.vscode-resource.vscode-cdn.net/Users/aet/Documents/git_projects/coding-for-economists/~/mambaforge/envs/codeforecon/lib/python3.10/site-packages/pyfixest/estimation.py:129), in feols(fml, data, vcov, ssc, fixef_rm, collin_tol)
    126 _estimation_input_checks(fml, data, vcov, ssc, fixef_rm, collin_tol)
    128 fixest = FixestMulti(data=data)
--> 129 fixest._prepare_estimation("feols", fml, vcov, ssc, fixef_rm)
    131 # demean all models: based on fixed effects x split x missing value combinations
    132 fixest._estimate_all_models(vcov, fixest._fixef_keys, collin_tol=collin_tol)

File [~/mambaforge/envs/codeforecon/lib/python3.10/site-packages/pyfixest/FixestMulti.py:85](https://file+.vscode-resource.vscode-cdn.net/Users/aet/Documents/git_projects/coding-for-economists/~/mambaforge/envs/codeforecon/lib/python3.10/site-packages/pyfixest/FixestMulti.py:85), in FixestMulti._prepare_estimation(self, estimation, fml, vcov, ssc, fixef_rm)
     82 self._fixef_keys = None
     83 self._is_multiple_estimation = None
---> 85 fxst_fml = FixestFormulaParser(fml)
     86 fxst_fml.get_fml_dict()  # fxst_fml._fml_dict might look like this: {'0': {'Y': ['Y~X1'], 'Y2': ['Y2~X1']}}. Hence {FE: {DEPVAR: [FMLS]}}
     87 if fxst_fml._is_iv:

File [~/mambaforge/envs/codeforecon/lib/python3.10/site-packages/pyfixest/FormulaParser.py:100](https://file+.vscode-resource.vscode-cdn.net/Users/aet/Documents/git_projects/coding-for-economists/~/mambaforge/envs/codeforecon/lib/python3.10/site-packages/pyfixest/FormulaParser.py:100), in FixestFormulaParser.__init__(self, fml)
     98 if endogvars is not None:
     99     if len(endogvars) > len(instruments):
--> 100         raise UnderDeterminedIVError(
    101             "The IV system is underdetermined. Only fully determined systems are allowed. Please provide as many instruments as endogenous variables."
    102         )
    103     else:
    104         pass

UnderDeterminedIVError: The IV system is underdetermined. Only fully determined systems are allowed.

Grateful for any pointers!

s3alfisc commented 1 year ago

This is of course a bug and likely related to having the plus 1 in the first part of the formula, if you drop that, things should run through (I believe). I'll try to fix it asap =) btw I am really looking forward to all your feedback, there's likely a lot of edge cases that I have not thought about testing - just like this one.

aeturrell commented 1 year ago

Thanks Alex. Had a go at dropping the plus 1 (as below) but still got the same error in this case.

results_iv = feols("np.log(packs) ~ np.log(rincome) | C(year) + C(state) | np.log(rprice) ~ taxs ", data=dfiv, vcov={"CRV1": "year"})
results_iv.summary()
s3alfisc commented 1 year ago

Just took a look - it's a minor bug - in your example, the code accidentally compares the length of two strings vs two lists, which then triggers the error you find. I'll release a new version in a bit, as I found another small bug with the summary() function for IV objects.

Btw, it's currently not supported to wrap the fixed effects inside C(.) as ... | C(year) + C(state) | ... as I process the fixed effects outside of formulaic. The error you'd get is "KeyError: "None of [Index(['C(year)', 'C(state)'], dtype='object')] are in the [columns]"". Something I should allow for convenience.

s3alfisc commented 1 year ago

It's fixed now:

results_iv = feols("np.log(packs) ~ np.log(rincome) |year + state| np.log(rprice) ~ taxs ", data=dfiv, vcov={"CRV1": "year"})
results_iv.summary()

# ###
# 
# Estimation:  IV
# Dep. var.: np.log(packs), Fixed effects: year+state
# Inference:  CRV1
# Observations:  96

# | Coefficient     |   Estimate |   Std. Error |                t value |   Pr(>|t|) |   2.5 % |   97.5 % |
# |:----------------|-----------:|-------------:|-----------------------:|-----------:|--------:|---------:|
# | np.log(rprice)  |     -1.279 |        0.000 | -38033610883142432.000 |      0.000 |  -1.279 |   -1.279 |
# | np.log(rincome) |      0.443 |        0.000 |   4193485804221483.000 |      0.000 |   0.443 |    0.443 |

Point estimates are identical to linearmodels and CIs are off (linearmodels does not even report them, maybe I should do so as well if t-stats get too large?).

I will merge #198 after the CI tests pass!

s3alfisc commented 1 year ago

It's merged into master. I'll release the new version to pypi later today =)

s3alfisc commented 1 year ago

And release to PyPi!

aeturrell commented 1 year ago

Crikey, amazing! Thanks so much for fixing so quickly. So, on the C(...) syntax, my default has always been to use it (explicit is better than implicit and all that). But having had a quick look at patsy, R formulas, and formulaic, I'm not sure how standard it is? My instinct would be if it's standard, have it in there, but if it's not, maybe I should change what's written in Coding for Economists to reflect this (ie remove C(...) notation), and put more emphasis on ensuring the data type going in to regressions as a fixed effect is categorical.

Re: t-tests, there's probably some value in switching to scientific notation at some point, but think this example is an edge case and it's not a big issue!