strengejacke / sjstats

Effect size measures and significance tests
https://strengejacke.github.io/sjstats
189 stars 21 forks source link

rms models with nonlinear and interaction terms and omega_sq() #22

Closed Deleetdk closed 4 years ago

Deleetdk commented 6 years ago

Per https://github.com/strengejacke/sjstats/issues/21, there is now support for rms models. Hurray! Example with linear model:

library(pacman)
p_load(rms, sjstats)

ols(Sepal.Width ~ Petal.Width, data = iris) %>%
  anova() %>% 
  omega_sq(ci.lvl = .95)

# A tibble: 2 x 4
  term        omega_sq conf.low conf.high
  <chr>          <dbl>    <dbl>     <dbl>
1 Petal.Width    0.112   0.0475     0.235
2 TOTAL          0.112   0.0475     0.235

However, the output can be odd when there are nonlinear because rms inserts these as extra rows:

ols(Sepal.Width ~ rcs(Petal.Width) + rcs(Petal.Length), data = iris) %>%
  anova() %>% 
  omega_sq(ci.lvl = .95)

# A tibble: 6 x 4
  term             omega_sq conf.low conf.high
  <chr>               <dbl>    <dbl>     <dbl>
1 Petal.Width     -0.000931   NA        0.0597
2 " Nonlinear"    -0.00894    NA       NA     
3 Petal.Length     0.00776    NA        0.0887
4 " Nonlinear"     0.0108     NA        0.0951
5 TOTAL NONLINEAR  0.156       0.127    0.342 
6 TOTAL            0.314       0.263    0.480 

Note the presence of missing values. And interactions:

ols(Sepal.Width ~ Petal.Width * Petal.Length, data = iris) %>%
  anova() %>% 
  omega_sq(ci.lvl = .95)

# A tibble: 6 x 4
  term                                                      omega_sq conf.low conf.high
  <chr>                                                        <dbl>    <dbl>     <dbl>
1 Petal.Width  (Factor+Higher Order Factors)                  0.106     0.156     0.375
2 " All Interactions"                                         0.0936    0.137     0.352
3 Petal.Length  (Factor+Higher Order Factors)                 0.129     0.193     0.413
4 " All Interactions"                                         0.0936    0.137     0.352
5 Petal.Width * Petal.Length  (Factor+Higher Order Factors)   0.0936    0.137     0.352
6 TOTAL                                                       0.190     0.279     0.494

It gets very messy with both:

ols(Sepal.Width ~ rcs(Petal.Width) * rcs(Petal.Length), data = iris, penalty = 1) %>%
  anova() %>% 
  omega_sq(ci.lvl = .95)

# A tibble: 15 x 4
   term                                                      omega_sq conf.low conf.high
   <chr>                                                        <dbl>    <dbl>     <dbl>
 1 Petal.Width  (Factor+Higher Order Factors)                 -0.0337  NA        NA     
 2 " All Interactions"                                        -0.0419  NA        NA     
 3 " Nonlinear (Factor+Higher Order Factors)"                 -0.0397  NA        NA     
 4 Petal.Length  (Factor+Higher Order Factors)                -0.0231  NA         0.0130
 5 " All Interactions"                                        -0.0419  NA        NA     
 6 " Nonlinear (Factor+Higher Order Factors)"                 -0.0304  NA        NA     
 7 Petal.Width * Petal.Length  (Factor+Higher Order Factors)  -0.0419  NA        NA     
 8 " Nonlinear"                                               -0.0391  NA        NA     
 9 " Nonlinear Interaction : f(A,B) vs. AB"                   -0.0391  NA        NA     
10 " f(A,B) vs. Af(B) + Bg(A)"                                -0.0253  NA        NA     
11 " Nonlinear Interaction in Petal.Width vs. Af(B)"          -0.0332  NA        NA     
12 " Nonlinear Interaction in Petal.Length vs. Bg(A)"         -0.0321  NA        NA     
13 TOTAL NONLINEAR                                            -0.0226  NA         0.0152
14 TOTAL NONLINEAR + INTERACTION                               0.0925   0.0234    0.146 
15 TOTAL                                                       0.237    0.100     0.256 

The output here seems impossible. Is it because omega_sq interprets every row as being a predictor thus causing double counting and impossible values? A solution here is to exclude rows based on regex matching of the extra lines, though in rare cases with oddly named terms (one could use a term called Nonlinear, notice the initial space), this could result in errors, but it seems unlikely to be a practical problem.

Total omega² vs. adj. R² vs. R²

I notice that the numbers do not match:

Model 1 (above). R2 = .132, adj. R2 = .128, omega² = .112. There are no other predictors, so the variance should be allocated to the only predictor there is. Note however that the output from anova() call has an odd-looking row, which looks like it is just the intercept:

ols(Sepal.Width ~ Petal.Width, data = iris) %>%
  anova()

                Analysis of Variance          Response: Sepal.Width 

 Factor      d.f. Partial SS MS        F     P     
 Petal.Width   1   3.794493  3.7944934 22.91 <.0001
 REGRESSION    1   3.794493  3.7944934 22.91 <.0001
 ERROR       148  24.512440  0.1656246             

Second model: R² = .440, adj. R² = .408, omega² = .314. Third model: R² = .416, adj. R² = .404, omega² = .190. Fourth model: R² = .431, adj. R² = .406, omega² = .237.

I also note that eta_sq() output does not match either R² or R² adj. either.

Deleetdk commented 6 years ago

Here's a comparison with car::Anova() results, which as far as I can tell from Harell's reply, should be the same for the simple cases without interaction. I updated the package to latest github before running this (https://github.com/strengejacke/sjstats/commit/c90f49ca725a026c79042a82ebf9b185aa85273a).

> library(pacman)
> p_load(rms, sjstats)
> 
> ols(Sepal.Width ~ Petal.Width, data = iris) %>%
+   anova() %>% 
+   omega_sq(ci.lvl = .95)
# A tibble: 2 x 4
  term        omegasq conf.low conf.high
  <chr>         <dbl>    <dbl>     <dbl>
1 Petal.Width   0.112   0.0475     0.235
2 TOTAL         0.112   0.0475     0.235
> 
> lm(Sepal.Width ~ Petal.Width, data = iris) %>%
+   car::Anova() %>% 
+   omega_sq(ci.lvl = .95)
# A tibble: 1 x 4
  term        omegasq conf.low conf.high
  <chr>         <dbl>    <dbl>     <dbl>
1 Petal.Width   0.127   0.0478     0.236
> 
> #two predictors
> ols(Sepal.Width ~ Petal.Width + Petal.Length, data = iris) %>%
+   anova() %>% 
+   omega_sq(ci.lvl = .95)
# A tibble: 3 x 4
  term         omegasq conf.low conf.high
  <chr>          <dbl>    <dbl>     <dbl>
1 Petal.Width   0.0217 0.000514     0.110
2 Petal.Length  0.0662 0.0215       0.184
3 TOTAL         0.182  0.100        0.313
> 
> lm(Sepal.Width ~ Petal.Width + Petal.Length, data = iris) %>%
+   car::Anova() %>% 
+   omega_sq(ci.lvl = .95)
# A tibble: 2 x 4
  term         omegasq conf.low conf.high
  <chr>          <dbl>    <dbl>     <dbl>
1 Petal.Width   0.0269 0.000521     0.111
2 Petal.Length  0.0818 0.0218       0.186
> 
> #two predictors w interaction
> ols(Sepal.Width ~ Petal.Width * Petal.Length, data = iris) %>%
+   anova() %>% 
+   omega_sq(ci.lvl = .95)
# A tibble: 6 x 4
  term                                                      omegasq conf.low conf.high
  <chr>                                                       <dbl>    <dbl>     <dbl>
1 Petal.Width  (Factor+Higher Order Factors)                 0.106     0.156     0.375
2 " All Interactions"                                        0.0936    0.137     0.352
3 Petal.Length  (Factor+Higher Order Factors)                0.129     0.193     0.413
4 " All Interactions"                                        0.0936    0.137     0.352
5 Petal.Width * Petal.Length  (Factor+Higher Order Factors)  0.0936    0.137     0.352
6 TOTAL                                                      0.190     0.279     0.494
> 
> lm(Sepal.Width ~ Petal.Width * Petal.Length, data = iris) %>%
+   car::Anova() %>% 
+   omega_sq(ci.lvl = .95)
# A tibble: 3 x 4
  term                     omegasq conf.low conf.high
  <chr>                      <dbl>    <dbl>     <dbl>
1 Petal.Width               0.0284  0.00347     0.128
2 Petal.Length              0.0834  0.0374      0.218
3 Petal.Width:Petal.Length  0.221   0.143       0.363

As can be seen, the car::Anova() results don't match exactly even for simple 1 predictor model.

strengejacke commented 6 years ago

The functions in sjstats rely on the tidy output of the related anova() functions, so the differences should not be due to computations in sjstats, but rather on the results from the anova()-call from those packages...

I'll look into this issue the next days...

Deleetdk commented 6 years ago

@strengejacke Any news on this? My study is stuck until this can be resolved. Perhaps I will need to switch to linear terms (no rcs for the purpose of computing the relative variable metrics) which would be a little annoying.

I would be happy to give a bounty for the resolvement of the issue, say, 100 USD.

harrelfe commented 6 years ago

You'll need to do the matrix operations on the anova() output to create a suitable object.

strengejacke commented 6 years ago

Any suggestions on how to implement this or even a PR are welcome! :-)

harrelfe commented 6 years ago

The matrix operations are very easy. I routinely add and drop rows from anova output. I think there is an example you can see when you type ?anova.rms. Putting this into the tidy workflow is up to users such as yourself, since I don't use tidy.

strengejacke commented 6 years ago

@Deleetdk Does this help you? I have not much experience with Anova methods, as I usually don't use them. I would leave it to users to propose a fix for this issue. At the moment I'm not exactly sure how to address this issue and I don't know when I will find the time to investigate further.

harrelfe commented 6 years ago

Please don't say "fix" as nothing is broken about this in the rms package. It is an issue of developing the right interface between rms and the tidyverse, for anyone motivated to do so. I am not motivated, because your approach leaves out important information from rms::anova.

strengejacke commented 6 years ago

Sorry if this was confusing, by "fix" I meant this issue, i.e. revise my code so that it also works with more complex functions like rms.anova. I think this is not specifically linked to the tidyverse, it's that my function returns a data frame that is tidy in the sense that its structure is always the same for all input.

strengejacke commented 6 years ago

And, of course, I was not specifically asking you to do it.

harrelfe commented 6 years ago

Thanks very much for that.

Deleetdk commented 6 years ago

@strengejacke Any progress on this? My bounty from before still stands, and I'm waiting for this issue. I'll post a public tweet and see if someone will submit a PR.

strengejacke commented 6 years ago

No, sorry. This currently has lower priority for me as I'm not so familiar with ANOVA and how to properly process the more complex objects returned by rms.

strengejacke commented 4 years ago

closing in favor of https://github.com/easystats/effectsize/issues/67