rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
341 stars 30 forks source link

[Feature request] support for fixest models #441

Closed albertostefanelli closed 8 months ago

albertostefanelli commented 10 months ago

As per title, it would be nice if the package would support fixest models. fixest supports multiple fixed-effects for OLS and GLM. It is based paralleled in C++ code and, thus, scales especially well for large data sets.

rvlenth commented 10 months ago

I wonder if it is already largely supported via qdrg:

> res = feols(Sepal.Length ~ Sepal.Width + Petal.Length, iris)
> (rg = qdrg(object = res, data = iris, at = list(Sepal.Width = 2:4, Petal.Length = 3:4)))
'emmGrid' object with variables:
    Sepal.Width = 2, 3, 4
    Petal.Length = 3, 4

> emmeans(rg, "Sepal.Width")
 Sepal.Width emmean     SE  df asymp.LCL asymp.UCL
           2   5.09 0.0801 Inf      4.93      5.25
           3   5.69 0.0281 Inf      5.63      5.74
           4   6.28 0.0692 Inf      6.15      6.42

Results are averaged over the levels of: Petal.Length 
Confidence level used: 0.95 
rvlenth commented 10 months ago

Just so you know, I have not forgotten this; hope to follow through soon.

rvlenth commented 9 months ago

I am really at a loss as to what is needed here, and whether to go through with it. I do have something that apparently works. But please look at these examples:

ores = feols(Sepal.Length ~ Sepal.Width + Petal.Length | Species, iris)
ores1 = feols(Sepal.Length ~ Sepal.Width + Petal.Length + Species, iris)

These are two models with and without what they call "fixed effects".

> emmeans(ores, ~ Sepal.Width*Petal.Length, at = list(Sepal.Width=3:4, Petal.Length=3:4))
 Sepal.Width Petal.Length emmean    SE  df asymp.LCL asymp.UCL
           3            3   3.62 0.358 Inf      2.92      4.32
           4            3   4.06 0.476 Inf      3.12      4.99
           3            4   4.40 0.396 Inf      3.62      5.17
           4            4   4.83 0.477 Inf      3.90      5.77

Confidence level used: 0.95 

> emmeans(ores1, ~ Sepal.Width*Petal.Length*Species, at = list(Sepal.Width=3:4, Petal.Length=3:4))
 Sepal.Width Petal.Length Species    emmean     SE  df asymp.LCL asymp.UCL
           3            3 setosa       6.01 0.1245 Inf      5.77      6.26
           4            3 setosa       6.45 0.1019 Inf      6.25      6.65
           3            4 setosa       6.79 0.1844 Inf      6.43      7.15
           4            4 setosa       7.22 0.1579 Inf      6.91      7.53
           3            3 versicolor   5.06 0.0999 Inf      4.86      5.25
           4            3 versicolor   5.49 0.1569 Inf      5.18      5.80
           3            4 versicolor   5.83 0.0528 Inf      5.73      5.94
           4            4 versicolor   6.27 0.1162 Inf      6.04      6.49
           3            3 virginica    4.62 0.1705 Inf      4.29      4.95
           4            3 virginica    5.05 0.2148 Inf      4.63      5.47
           3            4 virginica    5.40 0.1097 Inf      5.18      5.61
           4            4 virginica    5.83 0.1585 Inf      5.52      6.14

Confidence level used: 0.95 

The two models have the same predictions:

> grd = .Last.value@grid
> predict(ores, newdata=grd) - predict(ores1, newdata=grd)
 [1] 2.815526e-13 2.371436e-13 3.019807e-13 2.584599e-13 2.158274e-13 1.714184e-13 2.362555e-13
 [8] 1.927347e-13 8.970602e-14 4.529710e-14 1.101341e-13 6.661338e-14

Apparently, the "fixed effects" are really intercepts in the ores model, in that the predictions from ores work by just adding the fixed effects to the estimates we see in the first emmeans() output. As far as I can tell, there is no way to obtain the covariances between those fixed effects and the regression coefficients. (And I verified this by trying -- and failing -- to get results from predic(ores, newdata = grd, se= TRUE), whereas this works for ores1 and yields the same SEs as in the emmeans() output). Instead, the uncertainty shows up in the SEs; for ores, the SEs are between 0.358 and 0.477, while those for ores1 are much lower. I guess the idea is that the SEs we have here cover us for any possible value of the fixed effects?

Anyway, in this ores example, would you want Species included in the grid, or excluded, as shown? I'd be extremely uncomfortable including Species since I don't know how to estimate the SEs; and on the other hand, it appears that the results are extremely conservative. I'm just not familiar with what people would want and whether EMMs even make sense for those types of models.

rvlenth commented 9 months ago

I think I'm honing in on this. I've set it up so we can pass any arguments you want to vcov(), as the vignettes make it clear this is important to do. Also, I've set up a static offset equal to the sum of the means of the fixed effects; this makes it so the EMMs will be equal to those we obtain from lm() with the fixed factors added as ordinary predictors.

Following is the code for the required methods, followed by some examples of means and slopes:

### recover_data and emm_basis methods ================================

recover_data.fixest = function (object, ...) {
    fcall = object$call
    dat = recover_data(fcall, delete.response(terms(object)), object$na.action, 
                       pwts = weights(object), ...)
    if(!is.character(dat)) { # i.e., we actually did recover data
        # if any fixed effects, make them an offset
        fe = try(fixef(object), silent = TRUE)
        if(!inherits(fe, "try-error")) {
            dat$.static.offset. = sum(sapply(fe, mean))
            attr(dat, "predictors") = c(attr(dat, "predictors"), ".static.offset.")
        }
    }
    dat
}

emm_basis.fixest = function(object, trms, xlev, grid, ...) {
    # coef() works right for lm but coef.aov tosses out NAs
    bhat = coef(object)
    nm = if(is.null(names(bhat))) row.names(bhat) else names(bhat)
    m = suppressWarnings(model.frame(trms, grid, na.action = na.pass, xlev = xlev))
    X = model.matrix(trms, m, contrasts.arg = object$contrasts)
    assign = attr(X, "assign")
    X = X[, nm, drop = FALSE]
    bhat = as.numeric(bhat) 
    V = vcov(object, ...) # we're using fixest's method as it already allows vcov option

    # fixest handles rank deficiencies by tossing columns. Plus it seems hard to 
    # recover the portion of X after the fixed effects are orthogonalized out.
    # So at this point we are just ignoring non-estimability issues (and generally
    # bhat will NOT be aligned with linfct, creating mysterious errors downstream)
    # if (sum(is.na(bhat)) > 0)
    #     nbasis = estimability::nonest.basis(object$qr)
    # else
    nbasis = estimability::all.estble

    misc = list()
    if (!is.null(fam<- object$family)) {
        if(is.character(fam)) 
            fam = list(family = fam, link = "identity")
        misc = .std.link.labels(fam, misc)
        dfargs = list(df = Inf)
    }
    else
        dfargs = list(df = object$nobs - object$nparams)
    dffun = function(k, dfargs) dfargs$df
    list(X=X, bhat=bhat, nbasis=nbasis, V=V, dffun=dffun, dfargs=dfargs, misc=misc,
         model.matrix = .cmpMM(object$qr, assign = assign))
}

# multi methods ------------------------------------------------
recover_data.fixest_multi = function(object, which, ...) {
    if(missing(which))
        return("For 'fixest_multi' objects, you must use 'which' to specify which response to use.")
    recover_data.fixest(object[[which]], ...)
}

emm_basis.fixest_multi = function(object, trms, xlev, grid, which, ...)
    emm_basis.fixest(object[[which]], trms, xlev, grid, ...)

# Example ============================================
library(fixest)
library(emmeans)

data(Fatalities, package = "AER")
Fatalities$frate <- with(Fatalities, fatal/pop * 10000)
mod = feols(frate ~ breath * jail * beertax | state + year, data = Fatalities)
## NOTE: 1 observation removed because of NA values (RHS: 1).

emmeans(mod, ~ breath*jail)
## NOTE: Results may be misleading due to involvement in interactions
##  breath jail emmean    SE  df lower.CL upper.CL
##  no     no     2.01 0.142 274     1.74     2.29
##  yes    no     2.02 0.137 274     1.75     2.29
##  no     yes    2.13 0.136 274     1.87     2.40
##  yes    yes    2.16 0.133 274     1.90     2.42
## 
## Confidence level used: 0.95
emmeans(mod, ~ breath*jail, vcov = "iid")
## NOTE: Results may be misleading due to involvement in interactions
##  breath jail emmean    SE  df lower.CL upper.CL
##  no     no     2.01 0.103 274     1.81     2.22
##  yes    no     2.02 0.118 274     1.79     2.25
##  no     yes    2.13 0.119 274     1.90     2.37
##  yes    yes    2.16 0.147 274     1.87     2.45
## 
## Confidence level used: 0.95
emmeans(mod, ~ breath*jail, cluster = ~ state + year)
## NOTE: Results may be misleading due to involvement in interactions
##  breath jail emmean    SE  df lower.CL upper.CL
##  no     no     2.01 0.128 274     1.76     2.27
##  yes    no     2.02 0.130 274     1.77     2.28
##  no     yes    2.13 0.123 274     1.89     2.38
##  yes    yes    2.16 0.110 274     1.94     2.38
## 
## Confidence level used: 0.95

emtrends(mod, ~ breath * jail, var = "beertax")
##  breath jail beertax.trend    SE  df lower.CL upper.CL
##  no     no          -0.707 0.276 274  -1.2490  -0.1642
##  yes    no          -0.608 0.315 274  -1.2283   0.0124
##  no     yes         -0.346 0.348 274  -1.0322   0.3394
##  yes    yes          0.705 0.393 274  -0.0682   1.4774
## 
## Confidence level used: 0.95
emtrends(mod, ~ breath * jail, var = "beertax", vcov = "twoway")
##  breath jail beertax.trend    SE  df lower.CL upper.CL
##  no     no          -0.707 0.249 274  -1.1973  -0.2159
##  yes    no          -0.608 0.267 274  -1.1331  -0.0828
##  no     yes         -0.346 0.316 274  -0.9677   0.2750
##  yes    yes          0.705 0.335 274   0.0453   1.3639
## 
## Confidence level used: 0.95

Created on 2023-10-10 with reprex v2.0.2

I'm thinking of submitting this to the fixest site as a pull request. I think it's often better to turn this over to the package developers, as they may want tweaks or find cases that don't quite work. And as noted in code comments, we really are not handling rank-deficient models correctly.

Any comments?

rvlenth commented 9 months ago

I also note the surprising result that we are predicting higher fatality rates in places where there are jail sentences for drunk drivers; and that there is a significant positive trend for fatalities as beer tax increases when we have both breath tests and jail sentences. We might have confounding going on, such as that beer tax is mediated by other factors.

albertostefanelli commented 9 months ago

This is wonderful. Thanks for all the work.

Few notes:

  1. the SE for ores1 should be the same of a OLS model.
  2. I am also surprised that the SE are different between ores1 and ores. The only difference mentioned in the documentation is that the SE of the slope estimates when there is an interaction between a predictor and the FE are not estimated (i.e., random effect models) to make the computation faster.
  3. out of curiosity, what do you mean exactly by "hard to recover the portion of X after the fixed effects are orthogonalized out."?
rvlenth commented 9 months ago
  1. Yes that's true. That's because we didn't specify anything as a "fixed effect". I just don't get how you can have something that's called a "fixed effect" can i9nflate the variance so much. &All* the terms in ores1 are what I call fixed effects, and a lot of linear models theory backs me up on that. This is a different use of the term entirely. Obviously, I don't really understand what they are talking about.
  2. See 1.
  3. My intuition says that we are talking about a partitioned model matrix [F | X] where F is the model matrix for the "fixed effects" and X is the rest of the model matrix. By linear models theory, in order to get the coefficients for the X part of this system, you need to first adjust the response variable, as well as all the columns of X, for F. So X gets replaced by the residuals of each column of X after regressing them on F. Another name for that is orthogonalization.
  4. I have not progressed further, as I looked at additional examples involving instrumental variables. For those, we get coefficients in the summary, but those variables don't show up in the terms component of the model object and so far, I don't see a straightforward way of dealing with them. This refers back to (1) -- I really don't understand what I'm dealing with, and if I try to read about it I get all balled up in econometrics jargon.
rvlenth commented 9 months ago

I deleted a previous comment. I looked at this again and somehow I was just creating confusion. I do obtain a fixed-effects analysis:

lm.iris = lm(Sepal.Length ~ Sepal.Width + Petal.Length + Species, iris)
summary(lm.iris)$coefficients |> head()
##                     Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)        2.3903891 0.26226815  9.114294 5.942826e-16
## Sepal.Width        0.4322172 0.08138982  5.310458 4.025982e-07
## Petal.Length       0.7756295 0.06424566 12.072869 1.151112e-23
## Speciesversicolor -0.9558123 0.21519853 -4.441537 1.759999e-05
## Speciesvirginica  -1.3940979 0.28566053 -4.880261 2.759618e-06

library(fixest)
fe.iris = feols(Sepal.Length ~ Sepal.Width + Petal.Length | Species, iris)
summary(fe.iris, vcov = "iid")
## OLS estimation, Dep. Var.: Sepal.Length
## Observations: 150 
## Fixed-effects: Species: 3
## Standard-errors: IID 
##              Estimate Std. Error  t value  Pr(>|t|)    
## Sepal.Width  0.432217   0.081390  5.31046 4.026e-07 ***
## Petal.Length 0.775629   0.064246 12.07287 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.305129     Adj. R2: 0.859538
##                  Within R2: 0.641507

Created on 2023-10-28 with reprex v2.0.2

Note that both the lm and the feols analysis yields identical regression coefficients and SEs for Sepal.Width and Petal.Length when the latter is summarized with vcov = "iid".

rvlenth commented 9 months ago

OK, here's something to be careful about.: Given the emmeans support code I provided earlier, if you run emmeans, the estimates you obtain will be OK, but the standard errors will be wrong. However, trends and contrasts will be correct for both estimates and standard errors. See Illustration below.

The reason is that the intercept is included in the fixed effects, not among the regular model terms. We have set things up so the intercept is included in the estimates by computing a prior offset equal to the average of the fixed effects; however, we do not have information about the variance of the intercept, nor its covariances with the model terms. All those quantities are, implicitly, taken as zero. When computing trends or contrasts, the intercept receives no weight and so we do get the correct SEs as well as estimates.

PS -- This is really what is behind my confusion in my Oct 7 comment, where the SEs looked way out of whack.

library(AER, quietly = TRUE)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
data(Fatalities)
Fatalities$frate <- with(Fatalities, fatal/pop * 10000)

# Ordinary least-squares
lm.fit <- lm(frate ~ beertax + dry + youngdrivers + state + year, 
             data = Fatalities)
summary(lm.fit)$coefficients |> head()
##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)   2.48035486 0.54570384  4.54523991 8.182976e-06
## beertax      -0.64042382 0.19568991 -3.27264610 1.199576e-03
## dry           0.01990630 0.01538264  1.29407530 1.967096e-01
## youngdrivers  2.36102662 1.00980249  2.33810733 2.008836e-02
## stateaz      -0.01242474 0.45417151 -0.02735694 9.781946e-01
## statear      -0.83811177 0.31489322 -2.66157452 8.228571e-03

# "fixed-effects" model
library(fixest, quietly = TRUE)
fe.fit <- feols(frate ~ beertax + dry + youngdrivers | state + year, 
                data = Fatalities)
summary(fe.fit, vcov = "iid")
## OLS estimation, Dep. Var.: frate
## Observations: 336 
## Fixed-effects: state: 48,  year: 7
## Standard-errors: IID 
##               Estimate Std. Error  t value  Pr(>|t|)    
## beertax      -0.640424   0.195690 -3.27265 0.0011996 ** 
## dry           0.019906   0.015383  1.29408 0.1967096    
## youngdrivers  2.361027   1.009802  2.33811 0.0200884 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 0.169672     Adj. R2: 0.893362
##                  Within R2: 0.060002

source("h:/progs/R/devel/EMMEANS/fixest-support.R") # emmeans methods
library(emmeans)

emmeans(lm.fit, consec ~ dry, at = list(dry = 4:6))
## $emmeans
##  dry emmean     SE  df lower.CL upper.CL
##    4   2.04 0.0110 279     2.01     2.06
##    5   2.06 0.0152 279     2.03     2.08
##    6   2.07 0.0285 279     2.02     2.13
## 
## Results are averaged over the levels of: state, year 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast    estimate     SE  df t.ratio p.value
##  dry5 - dry4   0.0199 0.0154 279   1.294  0.1967
##  dry6 - dry5   0.0199 0.0154 279   1.294  0.1967
## 
## Results are averaged over the levels of: state, year 
## P value adjustment: mvt method for 2 tests
emmeans(fe.fit, consec ~ dry, at = list(dry = 4:6), vcov = "iid")
## $emmeans
##  dry emmean    SE  df lower.CL upper.CL
##    4   2.04 0.222 279     1.60     2.47
##    5   2.06 0.227 279     1.61     2.50
##    6   2.07 0.233 279     1.62     2.53
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast    estimate     SE  df t.ratio p.value
##  dry5 - dry4   0.0199 0.0154 279   1.294  0.1967
##  dry6 - dry5   0.0199 0.0154 279   1.294  0.1967
## 
## P value adjustment: mvt method for 2 tests

emtrends(lm.fit, "dry", var = "beertax")
##   dry beertax.trend    SE  df lower.CL upper.CL
##  4.27         -0.64 0.196 279    -1.03   -0.255
## 
## Results are averaged over the levels of: state, year 
## Confidence level used: 0.95
emtrends(fe.fit, "dry", var = "beertax", vcov = "iid")
##   dry beertax.trend    SE  df lower.CL upper.CL
##  4.27         -0.64 0.196 279    -1.03   -0.255
## 
## Confidence level used: 0.95

Created on 2023-10-29 with reprex v2.0.2

Observe that the dry EMMs match for the two models, but their SEs do not.. The dry differences and their SEs match for the two models and are equal to that of the dry coefficient in the model summaries. Similarly, the beertax trends and SEs are equal to the beertax effects in the model summaries.

rvlenth commented 8 months ago

I let this sit on the back burner for a while, but FWIW, I have created a pull request for this to be added to the fixest package.

rvlenth commented 8 months ago

Closing as it's resolved in its own way.

ocolluphid commented 3 months ago

not to reopen but just as comment re SEs: fixest automatically does clustered SEs on the first fixed effect if you don't specify otherwise -- that would explain the larger than expected SEs.

is there a way to output the emmeans contrasts using the clustered SEs?

rvlenth commented 3 months ago

Thanks. I did figure that out eventually.

And yes. In the pull request code I submitted, it is possible for the user to specify arguments that get passed to vcov.fixest()