rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
341 stars 30 forks source link

The long, sad saga of offset handling in emmeans #424

Closed rvlenth closed 8 months ago

rvlenth commented 1 year ago

Describe the bug

Until recently, emmeans worked more or less as intended for models with offsets. By "more or less", we mean cases where the offset was specified as an offset() term in the model formula. But then we learned that offsets can also be specified as an offset argument in functions such as glm(). If an offset is specified via the offset argument, it is not detected or used by any version of lsmeans, or by emmeans prior to version 1.8.4.

Starting with emmeans version 1.8.4, we did try to use offsets specified in either or both places. The problem is that we didn't use them correctly, because the "fix" involved replacing the original code with new code based on the get.offset() function. This seemed like it worked, and it certainly retrieved offsets regardless of where they were specified. However, we completely lost sight of the fact that offsets usually involve predictors. We were retrieving the values of the offsets rather than the values of the predictors used for offsets. This oversight created serious errors for models requiring offsets, and also serious gaffes in the offset example in the "sophisticateed" vignette.

To reproduce

Here is a script that can be used to help understand how things went wrong:

ins <- data.frame(
    n = c(500, 1200, 100, 400, 500, 300),
    size = factor(rep(1:3,2), labels = c("S","M","L")),
    age = factor(rep(1:2, each = 3)),
    claims = c(42, 37, 1, 101, 73, 14))
mod <- glm(claims ~ size + age + offset(log(n)/3), 
           offset = 2*log(n)/3,
           data = ins, 
           family = "poisson")

library(emmeans)
message("==========\nemmeans version: ", packageVersion("emmeans"))
emm <- emmeans(mod, "size")
message("\nEMMs:")
print(predict(emm))
message("\nemm@grid:")
print(emm@grid)

To run the test, save this script in a file and source it. The model is equivalent to the model ins.glm in the "sophisticated" vignette. It has offsets of log(n)/3 (in the model formula) and 2*log(n)/3 (in the offset argument); thus the total offset is the sum of these, which is log(n) as in ins.glm.

Here are some results of running this script with various versions of emmeans.

==========
emmeans version: 1.8.0

EMMs:
[1]  0.09481157 -0.59796617 -1.66946940

emm@grid:
  size .offset. .wgt.
1    S 2.071536     2
2    M 2.071536     2
3    L 2.071536     2

==========
emmeans version: 1.8.4.1

EMMs:
[1] 3.993227 3.300450 2.228946

emm@grid:
  size .offset. .wgt.
1    S 5.969952     2
2    M 5.969952     2
3    L 5.969952     2

==========
emmeans version: 1.8.7
> emm <- emmeans(mod, "size")
> message("\nEMMs:")

EMMs:
> print(predict(emm))
[1] 4.074779 3.382002 2.310498
> message("\nemm@grid:")

emm@grid:
> print(emm@grid)
  size .offset. .wgt.
1    S 6.051504     2
2    M 6.051504     2
3    L 6.051504     2

The result for emmeans 1.8.0 illustrates what happened with all versions before 1.8.4. The offset used in predictions is 2.071536. You can verify that this is equal to $\frac13\log500$, and this verifies that the part of the offset in the offset argument was ignored. However, if the offset() term in the model formula had been the only offset, it was handled correctly.

The result for emmeans 1.8.4.1 illustrates what we get for versions 1.8.4 through 1.8.6. The offset value used there is not $\log 500$ but somewhat less. You can verify that what is given is the mean of log(n), rather than the log of the mean of n (thus, we are using the geometric mean of n instead of the arithmetic mean of n). This happened because we combined the offsets actually used in fitting the model, and calculated the mean offset, when in fact we should have calculated the offset at the mean of n. Unfortunately, this really makes hash of the discussion of the example in the "sophisticated" vignette.

The result for emmeans 1.8.7 illustrates what we think is correct. We now treat the two different offsets in different ways. The part in the model formula is treated like a prediction using the current values of the variables involved, and the part in the offset argument is considered to have been computed in advance, and that part corresponds to an additional covariate with its regression coefficient set to 1. So in this example, the offset is (1/3)*log(mean(n)) + (2/3)*mean(log(n)). This seems pretty weird, but this model is pretty weird. In practice, we would probably put the whole offset in one place or another, depending on whether we want the offset to be log(mean(n)) (use offset() term) or mean(log(n)) (offset argument).

In version 1.8.7, we added a section to the "xplanations" vignette to explain in more detail.

Thanks to Tom Loughin for pointing out the geometric-mean discrepancy in the vignette.

rvlenth commented 8 months ago

Closing because people have had time to look at it, and can still find it if they are determined enough