rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
340 stars 30 forks source link

as.data.frame collapses groups and recalculates values #480

Closed gevro closed 2 months ago

gevro commented 2 months ago

Hi, I ran a pairs() comparison with simple=factor1 and cross.adjust="holm". This gives me the results I want.

However, this separates each level of factor1 into a separate section, but now I want all these results in one simple data frame.

I tried as.data.frame on the result, but this not only collapsed into into one data frame, which is what I wanted, but it also has now recalculated all the p-values and confidence intervals as if I had used the combine=TRUE parameter, which is an entirely different analysis.

Is there a way to collapse an analysis that was performed with simple, or with by, into a single simple data frame without recalculating values as if it was a non-grouped analysis? as.data.frame doesn't seem to work.

Note: the reason I want the results from all the group levels collapsed into one simple data frame (without recalculating any values) is for ease of use for downstream plotting.

Thanks!

gevro commented 2 months ago

Update: I think I figured it out (with dplyr syntax): pairs.result %>% unlist %>% set_names(NULL) %>% as.data.frame

This removes the names of the first object of the pairs.result list so that it doesn't get prefixed to the output columns of the final data frame. That way the final columns have the same names as the original pairs.result object.

Does this make sense?

gevro commented 2 months ago

Update 2: I found that a better more robust way is pairs.result %>% summary %>% as_tibble. Does that sound right?

rvlenth commented 2 months ago

I am traveling. But I suggest trying 'rbind(pairs.result)'

gevro commented 2 months ago

That doesn't seem to work. That is also causing emmeans to recalculate as if combine=TRUE was performed.

rvlenth commented 2 months ago

You're right, that messes up the adjustments.

I'm allergic to tibbles. So I'd do this:

> pigs.lm <- lm(inverse(conc) ~ source + factor(percent), data = pigs)
> pairs(ref_grid(pigs.lm), simple = "source", cross.adjust = "holm")
percent =  9:
 contrast    estimate      SE df t.ratio p.value
 fish - soy   0.00803 0.00134 23   6.009  <.0001
 fish - skim  0.01083 0.00137 23   7.922  <.0001
 soy - skim   0.00280 0.00134 23   2.092  0.4542

percent = 12:
 contrast    estimate      SE df t.ratio p.value
 fish - soy   0.00803 0.00134 23   6.009  <.0001
 fish - skim  0.01083 0.00137 23   7.922  <.0001
 soy - skim   0.00280 0.00134 23   2.092  0.4542

percent = 15:
 contrast    estimate      SE df t.ratio p.value
 fish - soy   0.00803 0.00134 23   6.009  <.0001
 fish - skim  0.01083 0.00137 23   7.922  <.0001
 soy - skim   0.00280 0.00134 23   2.092  0.4542

percent = 18:
 contrast    estimate      SE df t.ratio p.value
 fish - soy   0.00803 0.00134 23   6.009  <.0001
 fish - skim  0.01083 0.00137 23   7.922  <.0001
 soy - skim   0.00280 0.00134 23   2.092  0.4542

Note: contrasts are still on the inverse scale 
P value adjustment: tukey method for comparing a family of 3 estimates 
Cross-group P-value adjustment: holm 
> pairs(ref_grid(pigs.lm), simple = "source", cross.adjust = "holm") |> data.frame()
      contrast percent    estimate          SE df  t.ratio      p.value
1   fish - soy       9 0.008030286 0.001336358 23 6.009082 4.606262e-05
2  fish - skim       9 0.010832220 0.001367279 23 7.922467 5.948017e-07
3   soy - skim       9 0.002801935 0.001339092 23 2.092414 4.542344e-01
4   fish - soy      12 0.008030286 0.001336358 23 6.009082 4.606262e-05
5  fish - skim      12 0.010832220 0.001367279 23 7.922467 5.948017e-07
6   soy - skim      12 0.002801935 0.001339092 23 2.092414 4.542344e-01
7   fish - soy      15 0.008030286 0.001336358 23 6.009082 4.606262e-05
8  fish - skim      15 0.010832220 0.001367279 23 7.922467 5.948017e-07
9   soy - skim      15 0.002801935 0.001339092 23 2.092414 4.542344e-01
10  fish - soy      18 0.008030286 0.001336358 23 6.009082 4.606262e-05
11 fish - skim      18 0.010832220 0.001367279 23 7.922467 5.948017e-07
12  soy - skim      18 0.002801935 0.001339092 23 2.092414 4.542344e-01
rvlenth commented 2 months ago

PS Keep in mind that a summary_emm object is already a data.frame:

> pairs(ref_grid(pigs.lm), simple = "source", cross.adjust = "holm") |> summary() -> smry
> class(smry)
[1] "summary_emm" "data.frame" 

I really want users to see the annotations so that they know what they have. So I designed the as.data.frame method to show more digits but still show the annotations.

gevro commented 2 months ago

Got it thank you!!

rvlenth commented 2 months ago

OK, but my hazy memory was that there were some additional comments about results that came out as a list, and I don't see those parts so you must have deleted them. For instance, with my same example,

pairs(ref_grid(pigs.lm), simple = "each", cross.adjust = "holm") -> comps

names(comps)
## [1] "simple contrasts for source"  "simple contrasts for percent"

Then as you described, rbind(comps) destroys the adjusted P values, and do.call("rbind", summary(comps)) gives us mismatched column names. Also, rbind(summary(comps)) still gives a list of two summaries, which is not what was intended. So there's no easy way to do what you describe.

I'm considering what might be done with this. Thinking out loud, seems like maybe adjust = "original" or something might be possible, to keep the P values from being re-adjusted.

gevro commented 2 months ago

I think the tibble solution works but to avoid tibbles, maybe a solution would be to add two options for the 'combine' parameter- one that recalculates and one that doesn't? Or some new function that does this to avoid breaking people's prior code.

mirzaghaderi commented 2 months ago

Hi, I use emmeans for compairing pairs of means and need the confidence interval (CI) for each difference. I see that CI is calculated for each individual mean but how can I get CI for the pair differences? Thanks.

rvlenth commented 2 months ago

@mirzaghaderi Use confint(<result>)

mirzaghaderi commented 2 months ago

Many thanks. Does it also work for repeated measure data? similar to paired t.test?

rvlenth commented 2 months ago

@mirzaghaderi I works for any analyses you do in the emmeans package. This is kind of off-topic relative to this issue of collapsing data frames. So if you have more questions, please email me or create a separate issue.

rvlenth commented 2 months ago

@gevro OK, I have a clearer head about this now.

  1. An important fact is that an emmGrid object has a way of storing and passing-on a desired adjustment method. However, the actual P-value adjustments are not computed until summary() is called.

  2. The other side of that coin is that once summary() is called, the resulting summary_emm object has whatever adjusted P values are in it, and those cannot be changed because we no longer have the needed information to compute adjusted P values.

  3. I had provided no rbind() method for summary_emm objects or a list thereof.

Considering (2) and (3), the solution to your question is to implement an rbind() method for summary_eml objects (a list of summary_emm objects), and due to (22), whatever P values are there will be preserved.

So I have done that. Here is the unavoidable consequence of point (1):

> pairs(ref_grid(pigs.lm), simple = "each", cross.adjust = "holm") -> comps

> comps |> rbind()
 percent source contrast              estimate      SE df t.ratio p.value
 9       .      fish - soy            0.008030 0.00134 23   6.009  0.0001
 9       .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 9       .      soy - skim            0.002802 0.00134 23   2.092  1.0000
 12      .      fish - soy            0.008030 0.00134 23   6.009  0.0001
 12      .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 12      .      soy - skim            0.002802 0.00134 23   2.092  1.0000
 15      .      fish - soy            0.008030 0.00134 23   6.009  0.0001
 15      .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 15      .      soy - skim            0.002802 0.00134 23   2.092  1.0000
 18      .      fish - soy            0.008030 0.00134 23   6.009  0.0001
 18      .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 18      .      soy - skim            0.002802 0.00134 23   2.092  1.0000
 .       fish   percent9 - percent12  0.005244 0.00142 23   3.704  0.0351
 .       fish   percent9 - percent15  0.005971 0.00151 23   3.960  0.0186
 .       fish   percent9 - percent18  0.008169 0.00171 23   4.785  0.0024
 .       fish   percent12 - percent15 0.000727 0.00147 23   0.495  1.0000
 .       fish   percent12 - percent18 0.002925 0.00165 23   1.772  1.0000
 .       fish   percent15 - percent18 0.002198 0.00174 23   1.261  1.0000
 .       soy    percent9 - percent12  0.005244 0.00142 23   3.704  0.0351
 .       soy    percent9 - percent15  0.005971 0.00151 23   3.960  0.0186
 .       soy    percent9 - percent18  0.008169 0.00171 23   4.785  0.0024
 .       soy    percent12 - percent15 0.000727 0.00147 23   0.495  1.0000
 .       soy    percent12 - percent18 0.002925 0.00165 23   1.772  1.0000
 .       soy    percent15 - percent18 0.002198 0.00174 23   1.261  1.0000
 .       skim   percent9 - percent12  0.005244 0.00142 23   3.704  0.0351
 .       skim   percent9 - percent15  0.005971 0.00151 23   3.960  0.0186
 .       skim   percent9 - percent18  0.008169 0.00171 23   4.785  0.0024
 .       skim   percent12 - percent15 0.000727 0.00147 23   0.495  1.0000
 .       skim   percent12 - percent18 0.002925 0.00165 23   1.772  1.0000
 .       skim   percent15 - percent18 0.002198 0.00174 23   1.261  1.0000

Note: contrasts are still on the inverse scale 
P value adjustment: bonferroni method for 30 tests 

We have combined the two emmGrid objects and the default adjustment method is Bonferroni. Above we see the summary of that result (because the print method shows the summary).

But if we summarize first, then we get (via a newly coded rbind method which I'll push to GitHub soon):

> comps |> summary() |> rbind()
 percent source contrast              estimate      SE df t.ratio p.value
 9       .      fish - soy            0.008030 0.00134 23   6.009  <.0001
 9       .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 9       .      soy - skim            0.002802 0.00134 23   2.092  0.4542
 12      .      fish - soy            0.008030 0.00134 23   6.009  <.0001
 12      .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 12      .      soy - skim            0.002802 0.00134 23   2.092  0.4542
 15      .      fish - soy            0.008030 0.00134 23   6.009  <.0001
 15      .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 15      .      soy - skim            0.002802 0.00134 23   2.092  0.4542
 18      .      fish - soy            0.008030 0.00134 23   6.009  <.0001
 18      .      fish - skim           0.010832 0.00137 23   7.922  <.0001
 18      .      soy - skim            0.002802 0.00134 23   2.092  0.4542
 .       fish   percent9 - percent12  0.005244 0.00142 23   3.704  0.0179
 .       fish   percent9 - percent15  0.005971 0.00151 23   3.960  0.0097
 .       fish   percent9 - percent18  0.008169 0.00171 23   4.785  0.0013
 .       fish   percent12 - percent15 0.000727 0.00147 23   0.495  1.0000
 .       fish   percent12 - percent18 0.002925 0.00165 23   1.772  0.9359
 .       fish   percent15 - percent18 0.002198 0.00174 23   1.261  1.0000
 .       soy    percent9 - percent12  0.005244 0.00142 23   3.704  0.0179
 .       soy    percent9 - percent15  0.005971 0.00151 23   3.960  0.0097
 .       soy    percent9 - percent18  0.008169 0.00171 23   4.785  0.0013
 .       soy    percent12 - percent15 0.000727 0.00147 23   0.495  1.0000
 .       soy    percent12 - percent18 0.002925 0.00165 23   1.772  0.9359
 .       soy    percent15 - percent18 0.002198 0.00174 23   1.261  1.0000
 .       skim   percent9 - percent12  0.005244 0.00142 23   3.704  0.0179
 .       skim   percent9 - percent15  0.005971 0.00151 23   3.960  0.0097
 .       skim   percent9 - percent18  0.008169 0.00171 23   4.785  0.0013
 .       skim   percent12 - percent15 0.000727 0.00147 23   0.495  1.0000
 .       skim   percent12 - percent18 0.002925 0.00165 23   1.772  0.9359
 .       skim   percent15 - percent18 0.002198 0.00174 23   1.261  1.0000

Note: contrasts are still on the inverse scale 
Cross-group P-value adjustment: holm 

I still have it produce an summary_emm object, which inherits from data.frame. The messages shown are the common messages from the two summaries. I'm considering what's right for that aspect.

And then there's the matter of as.data.frame(). Currently the emm_list method combines them using rbind, but the summary_eml method does not. I think I probably should change that. After I decide on those things, and document the methods, then they'll get pushed up.

gevro commented 2 months ago

Got it - thanks! So basically the solution that users will end up using is basically first doing summary and then passing that to rbind?

I'm assuming that summary -> as_tibble will also work? (even though you don't like tibbles :-)

rvlenth commented 2 months ago

Regarding as_tibble(). I don't think that will work to combine a list of summaries:

> scomps |> tibble::as_tibble()
Error in `recycle_columns()`:
! Tibble columns must have compatible sizes.
• Size 12: Column `simple contrasts for source`.
• Size 18: Column `simple contrasts for percent`.
ℹ Only values of size one are recycled.
Run `rlang::last_trace()` to see where the error occurred.

The message also seems to imply that it wants to combine them side-by-side. Anyway, if you really want a tibble, I think you still need my new rbind() method (or as.data.frame() now runs rbind():

> scomps |> rbind() |> tibble::as_tibble()
# A tibble: 30 × 8
   percent source contrast    estimate      SE    df t.ratio     p.value
   <chr>   <chr>  <fct>          <dbl>   <dbl> <dbl>   <dbl>       <dbl>
 1 9       .      fish - soy   0.00803 0.00134    23    6.01 0.0000461  
 2 9       .      fish - skim  0.0108  0.00137    23    7.92 0.000000595
 3 9       .      soy - skim   0.00280 0.00134    23    2.09 0.454      
 4 12      .      fish - soy   0.00803 0.00134    23    6.01 0.0000461  
 5 12      .      fish - skim  0.0108  0.00137    23    7.92 0.000000595
 6 12      .      soy - skim   0.00280 0.00134    23    2.09 0.454      
 7 15      .      fish - soy   0.00803 0.00134    23    6.01 0.0000461  
 8 15      .      fish - skim  0.0108  0.00137    23    7.92 0.000000595
 9 15      .      soy - skim   0.00280 0.00134    23    2.09 0.454      
10 18      .      fish - soy   0.00803 0.00134    23    6.01 0.0000461  
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows
gevro commented 2 months ago

Got it! Let me know once the new version is online. Thanks!

rvlenth commented 2 months ago

You may install it now from GitHub - see how under "Code"

gevro commented 2 months ago

Awesome. I'm assuming it will update on CRAN too at some point soon.