py-econometrics / pyfixest

Fast High-Dimensional Fixed Effects Regression in Python following fixest-syntax
https://py-econometrics.github.io/pyfixest/
MIT License
175 stars 35 forks source link

Bug: Incorrect prediction with WLS anf fixed effects and newdata not None #678

Closed s3alfisc closed 3 weeks ago

s3alfisc commented 3 weeks ago

For WLS estimation with newdata argument provided, the output of predict is incorrect.

For all other cases, it matched fixest.

See examples below:

No newdata, no weights

import pyfixest as pf
import pandas as pd
pd.options.display.precision = 7

df_castle = pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/castle.dta")
df_castle = df_castle.astype({"post": bool})

data = df_castle.loc[df_castle["post"] == 0, :]
vcov = "hetero"

fit = pf.feols(
  fml = "l_homicide ~ 1 | state + year",
  data = data, 
)

fit.predict()[0:5]
# array([1.96511559, 1.98852372, 1.96735659, 2.01274515, 2.00770564])
library(fixest)
library(reticulate)

fit = feols(
  fml = l_homicide ~ 1 | state + year,
  data = py$data, 
)

predict(fit)[0:5]
# [1] 1.965116 1.988524 1.967357 2.012745 2.007706

No newdata, weights

fit = pf.feols(
  fml = "l_homicide ~ 1 | state + year",
  data = data, 
  weights = "popwt"
)

fit.predict()[0:5]
#array([1.98343013, 2.0046152 , 1.99374426, 2.01362497, 1.98860927])
fit = feols(
  fml = l_homicide ~ 1 | state + year,
  data = py$data, 
  weights = ~popwt
)

predict(fit)[0:5]
# [1] 1.983430 2.004615 1.993744 2.013625 1.988609

newdata, no weights

fit = pf.feols(
  fml = "l_homicide ~ 1 | state + year",
  data = data, 
)

fit.predict(data.iloc[0:100])[0:5]
# array([1.98873732, 2.01639826, 1.94689312, 1.99400977, 2.0091239 ])
fit = feols(
  fml = l_homicide ~ 1 | state + year,
  data = py$data, 
)

predict(fit, newdata = py$data[1:100,])[0:5]
# [1] 1.988737 2.016398 1.946893 1.994010 2.009124

newdata, weights

fit = pf.feols(
  fml = "l_homicide ~ 1 | state + year",
  data = data, 
  weights = "popwt"
)

fit.predict(newdata = data.iloc[0:100])[0:5]
# array([1.98028402, 1.99191327, 1.9958799 , 2.02384829, 1.99859166])
fit = feols(
  fml = l_homicide ~ 1 | state + year,
  data = py$data, 
  weights = ~popwt
)

predict(fit, newdata = py$data[1:100,])[0:5]
# [1] 1.983430 2.004615 1.993744 2.013625 1.988609

@all-contributors please add @marcandre259 for bug

s3alfisc commented 3 weeks ago

@all-contributors please add @marcandre259 for bug

allcontributors[bot] commented 3 weeks ago

@s3alfisc

I've put up a pull request to add @marcandre259! :tada:

s3alfisc commented 3 weeks ago

fixest produces different values for sumFE than PyFixest:

fit = pf.feols(
  fml = "l_homicide ~ cdl | state + year",
  data = data, 
  weights = "popwt"
)

fit.fixef()
fit._sumFE[0:5]
# array([1.96134687, 1.984755  , 1.96358787, 2.00897643, 2.00393692])
fit = feols(
  fml = l_homicide ~ cdl | state + year,
  data = py$data, 
  weights = ~popwt
)

fit$sumFE[1:5]
# [1] 1.979555 2.000740 1.989869 2.009749 1.984734

In all likelihood, the bug hence stems from the fixef() method, which does not seem to take weights into account at all.

In particular, the least squares problem solved via lsqr does not deal with weights: https://github.com/py-econometrics/pyfixest/blob/5250fc32d5a3fdff10eca2980ab1890d7e537310/pyfixest/estimation/feols_.py#L1535

s3alfisc commented 3 weeks ago

Hi @marcandre259 & also @leostimpfle , I found the bug:

For a regression without weights, we are fitting a regression model of the following form:

$$ [ Y = \begin{bmatrix} X & D \end{bmatrix} \begin{bmatrix} \beta \ \alpha \end{bmatrix}^{T} + \epsilon ] $$

After fitting $\hat{\beta}$ via FWL, we have $\hat{\beta}$.

We can then solve for $\hat{\alpha}$ via the a sparse solver in lsqr.

$$ [ D'D\hat{\alpha} = D'(Y - X\hat{\beta}). ] $$

If we have weights, we need to multiply D, X and Y with the standard $\sqrt(weights)$, which we simply did not do!

All of this (minus the part about weights) is nicely explained in the lfe vignette: link.

I'll prepare the PR for this now + add some tests.

@leostimpfle, this is also what we have to use for retrieving the alphas in the Poisson class (with the correct weights from the Poisson fit), as the strategy to fit GLMs in PyFixest is to implement an iterated weighted least squares algo (and therefore, the link function does not matter). For the same reason, Fepois can inherit the vcov class from Feols.