Closed jialianghua closed 11 months ago
Please note that the vignette says
... for a model with all interactions, "cells"-weighted EMMs are the same as the ordinary marginal means of the data.
The proviso that the model include all interactions is key. Under that condition, the predicted values on the reference grid are the cell means $\bar y{ijk\cdots}$. If we average over the subscript $i$ with weights $n{ijk\cdots}$, we obtain $$\frac{\sumi n{ijk\cdots}\cdot\bar y_{ijk...}}{\sumi n{ijk\cdots}} = \frac{y{+jk\cdots}}{n{+jk\cdots}} $$ where a $+$ in a subscript indicates that we have summed over that subscript. This is the same as the raw marginal mean for subscript combinations $jk\cdots$. This argument easily generalizes to averaging over other subscripts, or more than one subscipt.
OK thank you! I understand that we are discussing a linear model that includes all interactions now. While I am aware that this inquiry extends beyond the scope of your package, I am wondering if you can provide any mathematical proof or derivation demonstrating the predicted values from a linear model with all interactions on the reference grid are equal to the cell means?
The model with all interactions allows for a separate fitted value for each cell to be determined independently of any other cell, whenever there are data in that cell. Suppose that the observations in a particular cell are $y_1,y_2,\ldots,y_n$ and let the fitted value for that cell be $a$. Then the error sum of squares for that cell is $$\sum(y_i-1)^2 = \sum[(y_i-\bar y) + (\bar y - a)]^2 = \sum(y_i-\bar y)^2 + 2((\bar y -a)\sum(y_i - \bar y) + n(\bar y - a)^2$$ The second term is zero because $\sum y_i = n\bar y$, and the third is non-negative. To minimize the sum of squares (i.e., least-squares estimation), we require that the third term also be zero, and hence that $a = \bar y$.
OK, before you ask, suppose there is a covariate (numeric predictor) $x$ in the model that interacts with all the factors. Here we require to be using the default reference grid, i.e. that there is one reference value of $x$, namely $\bar x$. The model fits a separate regression line in each cell. Refer to a standard regression text where they show each regression line passes through the point of means $(\bar x, \bar y)$. For more than one covariate, expand this argument with one covariate at a time.
Thank you Russell! So if I am understanding right, can I summarize it this way?:
Perhaps, though I hesitate to create something that could be taken as new terminology. I think I'm satisfied with the explanations in the documentation.
I think this is resolved, so am closing.
It's resolved. Thank you so much Russ!
Hi Russell,
I hope you are doing well.
I read the documentation and vignettes of emmeans. And I found in the documents you stated that applying the weights = "cells" in a linear model essentially reproduces the raw marginal means of the data.
To delve deeper into this concept, I conducted some empirical tests using various datasets in R, which indeed supported the assertion you mentioned. The "cells"-weighted EMMs consistently mirrored the ordinary marginal means across all tested datasets. While the empirical evidence is compelling, I am keen to understand the theoretical foundation of this equivalence. I think I lack a clear intuition on this: When we apply weights = "cells" in a linear model, what we are doing is averaging the linear model predictions. In contrast, calculating a raw marginal average simply involves averaging the data. I don't have a good intuition about how these 2 equals. Could you provide a mathematical proof or an explanation that elucidates how the "cells" weighting of the EMMs in a linear model leads to this outcome? A theoretical perspective would greatly enhance my comprehension/intuition of the underlying principles.
Thank you so much for your contributions to the field and for your assistance with this query. I am a Biostatistical Data Analyst in NYC and your package really helped my work a lot. I look forward to your insightful response.
All the Best, JH.