In extract_random_effects(), if the effect is a factor, no label is passed

xfim commented 3 years ago

When I try to get the random effects of a model using bam(), the effect is not fully labelled, leaving necessary factor labels not passed.

For instance, using the example in [https://m-clark.github.io/posts/2019-10-20-big-mixed-models/]:

library(lme4)
library(mgcv)
library(dplyr)
library(mixedup)

ss <- sleepstudy %>%
  mutate(Period = as.factor(ifelse(Days < 4, "Before", "After")))

ga_model = gam(
    Reaction ~  Period + s(Subject, bs = 're') + s(Period, Subject, bs = 're'),
    data = ss,
    method = 'REML'
)

extract_random_effects(ga_model) %>%
  data.frame()

I would expect that in the column with the effect the "Period" would instead read as "PeriodBefore" and "PeriodAfter", instead of only "Period", as this does not allow to recognize whether it is one or the other.

Thank you very much for the package.

m-clark commented 3 years ago

Thanks for noting this @xfim. This is really more of an mgcv issue, because bam/gam doesn't store the factor labels for the coefficients in the first place, and extract_random_effects only works with what's provided. To demonstrate:

> coef(ga_model)
         (Intercept)         PeriodBefore         s(Subject).1         s(Subject).2         s(Subject).3         s(Subject).4         s(Subject).5 
         319.2629074          -51.8875394           28.7228264          -59.4375205          -48.6033931            5.7081877            9.8441102 
        s(Subject).6         s(Subject).7         s(Subject).8         s(Subject).9        s(Subject).10        s(Subject).11        s(Subject).12 
           5.5448207           13.2677564           -2.7385305          -32.2469930           54.6258130          -17.1704188            8.5988755 
       s(Subject).13        s(Subject).14        s(Subject).15        s(Subject).16        s(Subject).17        s(Subject).18  s(Period,Subject).1 
          -4.8073688           28.0961886            5.0391359           -7.2675004           -1.5965285           14.4205392           31.5049932 
 s(Period,Subject).2  s(Period,Subject).3  s(Period,Subject).4  s(Period,Subject).5  s(Period,Subject).6  s(Period,Subject).7  s(Period,Subject).8 
         -17.2344852          -30.1831305            0.6524847          -21.7357192           -2.4121519          -13.5209369           16.3569649 
 s(Period,Subject).9 s(Period,Subject).10 s(Period,Subject).11 s(Period,Subject).12 s(Period,Subject).13 s(Period,Subject).14 s(Period,Subject).15 
          -9.3206750           14.2115743            7.9437793           -5.1889178            2.0457063            4.5461806            1.7642156 
s(Period,Subject).16 s(Period,Subject).17 s(Period,Subject).18 s(Period,Subject).19 s(Period,Subject).20 s(Period,Subject).21 s(Period,Subject).22 
          -3.1248136          -32.6665167           16.6450791           31.1379207           -3.9979001           -1.8034280           -6.7274385 
s(Period,Subject).23 s(Period,Subject).24 s(Period,Subject).25 s(Period,Subject).26 s(Period,Subject).27 s(Period,Subject).28 s(Period,Subject).29 
          19.6033070          -15.3310840          -10.8151421            8.4266726           12.0150820            1.9440903            4.8922833 
s(Period,Subject).30 s(Period,Subject).31 s(Period,Subject).32 s(Period,Subject).33 s(Period,Subject).34 s(Period,Subject).35 s(Period,Subject).36 
          -2.3886638           13.1895036          -16.8002528           -7.2459180            6.4527066            3.1946754            3.9699545

When it is a standard intercept/slope setting, matching the names is straight forward, but here there is nothing to note that there are multiple levels of Period. I could maybe do some sort of check on whether it's a factor, and if so, do something accordingly, but I'm not sure how well it would generalize. Honestly it would be better if mgcv provided the names in the first place to avoid confusion (including for Subject), and having to assume a specific order has been maintained, so I'm inclined to leave it on their end, at least for now.

There is a silver lining here though, a path to get exactly what you want. When you're doing categorical random effects like this, you can just create a single 'interaction' random effect to get the same result (I have a post here with more detail). It probably doesn't matter as much with mgcv, but at least with lme4, specifying the model this way is actually more likely to converge (and more quickly), so it's a good approach to have in your toolbox in general. The basic idea is just to create a separate single factor of the Period/Subject combination, and use it for a random effect.

library(lme4)
#> Loading required package: Matrix
library(mgcv)
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#> 
#>     lmList
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:nlme':
#> 
#>     collapse
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(mixedup)

ss <- sleepstudy %>%
  mutate(Period = as.factor(ifelse(Days < 4, "Before", "After")))

ga_model = gam(
  Reaction ~  Period + s(Subject, bs = 're') + s(Period, Subject, bs = 're'),
  data = ss,
  method = 'REML'
)

# extract_random_effects(ga_model) %>%
#   data.frame()

ss = ss %>%
  mutate(
    Period  = as.factor(ifelse(Days < 4, "Before", "After")),
    Period2 = interaction(Subject, Period)
    )

ga_model2 = gam(
  Reaction ~  Period + s(Subject, bs = 're') + s(Period2, bs = 're'),  # note that subject is not needed again in the second effect
  data = ss,
  method = 'REML'
)

period_effects1 = extract_random_effects(ga_model) %>% 
  filter(effect == 'Period')

period_effects2 = extract_random_effects(ga_model2) %>% 
  filter(grepl(group, pattern = 'Before|After')) %>% 
  arrange(group)

bind_cols(period_effects1, period_effects2) %>% 
  select(matches('group|effect|value'))
#> New names:
#> * group_var -> group_var...1
#> * effect -> effect...2
#> * group -> group...3
#> * value -> value...4
#> * se -> se...5
#> * ...
# A tibble: 36 x 8
   group_var...1 effect...2 group...3 value...4 group_var...8 effect...9 group...10 value...11
   <chr>         <chr>      <chr>         <dbl> <chr>         <chr>      <chr>           <dbl>
 1 Subject       Period     308          31.5   Period2       Intercept  308.After      31.5  
 2 Subject       Period     308         -17.2   Period2       Intercept  308.Before    -17.2  
 3 Subject       Period     309         -30.2   Period2       Intercept  309.After     -30.2  
 4 Subject       Period     309           0.652 Period2       Intercept  309.Before      0.652
 5 Subject       Period     310         -21.7   Period2       Intercept  310.After     -21.7  
 6 Subject       Period     310          -2.41  Period2       Intercept  310.Before     -2.41 
 7 Subject       Period     330         -13.5   Period2       Intercept  330.After     -13.5  
 8 Subject       Period     330          16.4   Period2       Intercept  330.Before     16.4  
 9 Subject       Period     331          -9.32  Period2       Intercept  331.After      -9.32 
10 Subject       Period     331          14.2   Period2       Intercept  331.Before     14.2  
# … with 26 more rows

As you can see, the results are the same, and the new label is clear. Model summaries likewise are the same.

xfim commented 3 years ago

This does the trick. Thank you @m-clark .

m-clark / mixedup

In extract_random_effects(), if the effect is a factor, no label is passed #28