tidymodels / multilevelmod

Parsnip wrappers for mixed-level and hierarchical models
https://multilevelmod.tidymodels.org/
Other
74 stars 3 forks source link

Adding GEE models? #5

Closed shah-in-boots closed 3 years ago

shah-in-boots commented 3 years ago

Hi. I've started working on making GEE model specifications using the gee package to be compatible with parsnip. I'm wondering if its better than, instead of keeping them just in my own loading functions, to integrate it with pre-existing packages, but not sure if:

  1. GEEs have already been added
  2. If this is the right repository/package for that topic.

GLMs and GEE are similar to me, but have different interpretations. I would like to be able to add population-based interpretations (instead of cluster-specific subject interpretations).

topepo commented 3 years ago

Overall, I'm all for adding gee (and this is the right place).

I'd need to figure out a good api since it doesn't specify the clustering variable via a formula.

shah-in-boots commented 3 years ago

I tried it out in an independent parsnip model to see what issues would arise...

  1. The core model parameters may be similar to that of the linear_reg and logistic_reg functions, but I haven't explored whats the best way to integrate them
  2. The ability to pass the clustering variable seems to be the problem at the level of model.frame (which is likely the issue that would be resolved by a formula that allowed for specification of clustering
f <- mpg ~ wt
gee_reg(cluster = cyl, correlation = "independence") %>%
    set_engine("geepack") %>%
    fit(f, data = mtcars)
# Error in model.frame.default(formula = mpg ~ wt, data = data, id = ~cyl) : 
# invalid type (language) for variable '(id)'
# Timing stopped at: 0 0 0

If you have any thoughts, would like to be able to add these into the modeling options in the future. It seems like a formula could be used, and could capture the clustering variable at the time of model fit.

topepo commented 3 years ago

I worked on this today. One possible solution is to use the formula method and specify the id there:

# example from ?gee
linear_reg() %>% 
   set_engine("gee", corstr = "exchangeable") %>% 
   fit(breaks ~ tension + id_var(wool), data = warpbreaks)
topepo commented 3 years ago

I've got it working for linear regression. If you want to test it out:

remotes::install_github("tidymodels/multilevelmod@gee-models")

To recap: gee::gee() specifies the id/cluster variable using an argument id that requires a vector. parsnip doesn't work that way so we enable this model to be fit using an artificial function called id_var() to be used in the formula. So, in the original package, the call would look like:

gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")

With parsnip, you can use the formula breaks ~ tension + id_var(wool).

Here's an example or two:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✓ broom     0.7.2          ✓ recipes   0.1.15    
#> ✓ dials     0.0.9.9000     ✓ rsample   0.0.8     
#> ✓ dplyr     1.0.2          ✓ tibble    3.0.4     
#> ✓ ggplot2   3.3.2          ✓ tidyr     1.1.2     
#> ✓ infer     0.5.3          ✓ tune      0.1.2     
#> ✓ modeldata 0.1.0          ✓ workflows 0.2.1     
#> ✓ parsnip   0.1.4.9000     ✓ yardstick 0.0.7     
#> ✓ purrr     0.3.4
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
library(multilevelmod)

data(sleepstudy, package = "lme4")

gee_spec <- 
  linear_reg() %>% 
  set_engine("gee") 

gee_fit <- 
  gee_spec %>% 
  fit(Reaction ~ Days + id_var(Subject), data = sleepstudy)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept)        Days 
#>   251.40510    10.46729
gee_fit
#> parsnip model object
#> 
#> Fit time:  11ms 
#> 
#>  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#>  gee S-function, version 4.13 modified 98/01/27 (1998) 
#> 
#> Model:
#>  Link:                      Identity 
#>  Variance to Mean Relation: Gaussian 
#>  Correlation Structure:     Independent 
#> 
#> Call:
#> gee::gee(formula = Reaction ~ Days, id = data$Subject, data = data, 
#>     family = gaussian)
#> 
#> Number of observations :  180 
#> 
#> Maximum cluster size   :  10 
#> 
#> 
#> Coefficients:
#> (Intercept)        Days 
#>   251.40510    10.46729 
#> 
#> Estimated Scale Parameter:  2276.694
#> Number of Iterations:  1
#> 
#> Working Correlation[1:4,1:4]
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    0    0    0
#> [2,]    0    1    0    0
#> [3,]    0    0    1    0
#> [4,]    0    0    0    1
#> 
#> 
#> Returned Error Value:
#> [1] 0

# change correlation stucture: 

exch_spec <- 
  linear_reg() %>% 
  set_engine("gee", corstr = "exchangeable") 

exch_fit <- 
  exch_spec %>% 
  fit(Reaction ~ Days + id_var(Subject), data = sleepstudy)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept)        Days 
#>   251.40510    10.46729
exch_fit
#> parsnip model object
#> 
#> Fit time:  4ms 
#> 
#>  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#>  gee S-function, version 4.13 modified 98/01/27 (1998) 
#> 
#> Model:
#>  Link:                      Identity 
#>  Variance to Mean Relation: Gaussian 
#>  Correlation Structure:     Exchangeable 
#> 
#> Call:
#> gee::gee(formula = Reaction ~ Days, id = data$Subject, data = data, 
#>     family = gaussian, corstr = "exchangeable")
#> 
#> Number of observations :  180 
#> 
#> Maximum cluster size   :  10 
#> 
#> 
#> Coefficients:
#> (Intercept)        Days 
#>   251.40510    10.46729 
#> 
#> Estimated Scale Parameter:  2276.694
#> Number of Iterations:  1
#> 
#> Working Correlation[1:4,1:4]
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5710385 0.5710385 0.5710385
#> [2,] 0.5710385 1.0000000 0.5710385 0.5710385
#> [3,] 0.5710385 0.5710385 1.0000000 0.5710385
#> [4,] 0.5710385 0.5710385 0.5710385 1.0000000
#> 
#> 
#> Returned Error Value:
#> [1] 0

Created on 2020-12-01 by the reprex package (v0.3.0)

shah-in-boots commented 3 years ago

It seems to be working for me. I'm happy to do a pull request and see if I can use your gee_fit and gee_formula to help make this work for the logistic regression component as well, or whatever else is needed (e.g. adding geepack as well as gee to the regression engines, as they use slightly different approaches for parameter estimations.

I like in particular your fit approach - there are several other regression types that use this clustering feature that is difficult to pass information through in advance. I think all this needs is a good set of tidiers and it should work.

library(multilevelmod)
#> Loading required package: parsnip
library(magrittr)
linear_reg() %>%
    set_engine("gee", corstr = "independence") %>%
    fit(mpg ~ wt + id_var(cyl), data = mtcars)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept)          wt 
#>   37.285126   -5.344472
#> parsnip model object
#> 
#> Fit time:  10ms 
#> 
#>  GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#>  gee S-function, version 4.13 modified 98/01/27 (1998) 
#> 
#> Model:
#>  Link:                      Identity 
#>  Variance to Mean Relation: Gaussian 
#>  Correlation Structure:     Independent 
#> 
#> Call:
#> gee::gee(formula = mpg ~ wt, id = data$cyl, data = data, family = gaussian, 
#>     corstr = "independence")
#> 
#> Number of observations :  32 
#> 
#> Maximum cluster size   :  6 
#> 
#> 
#> Coefficients:
#> (Intercept)          wt 
#>   37.285126   -5.344472 
#> 
#> Estimated Scale Parameter:  9.277398
#> Number of Iterations:  1
#> 
#> Working Correlation[1:4,1:4]
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    1    1    0
#> [2,]    0    0    0    0
#> [3,]    0    0    0    0
#> [4,]    0    0    0    0
#> 
#> 
#> Returned Error Value:
#> [1] 0
topepo commented 3 years ago

You are welcome to take what's in my branch and run with it. I haven't done anything with logistic or Poisson regression in multilevelmod yet but that is the plan. Those should be very similar to what is in multilevelmod (and you can also consult what we did in parsnip for those models).

I'm happy to help if anything is unclear (my documentation can always be better) or if you get stuck.

gee objects are sub-classed as glm so those tiders should kick in automatically. Unfortunately:

> class(exch_fit$fit)
[1] "gee" "glm"
> tidy(exch_fit$fit)
Error: No tidy method for objects of class gee

Perhaps @alexpghayes knows why the method dispatch does not forward to tidy.glm(). It would probably fail anyway since BDR does not use column names that are compatible with glm objects:

> summary(exch_fit$fit)$coef
             Estimate Naive S.E.  Naive z Robust S.E. Robust z
(Intercept) 251.40510  9.5378078 26.35879    6.632277  37.9063
Days         10.46729  0.8109579 12.90731    1.502237   6.9678

(╯°□°)╯︵ ┻━┻

alexpghayes commented 3 years ago

Perhaps @alexpghayes knows why the method dispatch does not forward to tidy.glm().

Yeah this is because people subclass glm all the time without actually behaving like a glm object so we error out purposefully when we detect this is the case. Previous behaviour will vary with your version of broom. We updated the behaviour for lm objects recently and should do the same for glm objects (see https://github.com/tidymodels/broom/blob/master/R/stats-lm-tidiers.R#L83). cc @simonpcouch

simonpcouch commented 3 years ago

The method dispatch doesn't forward to tidy.glm because the gee tidiers are explicitly overwritten with the *.default methods to error out. As @topepo predicted, though, they do error out if the *.glm tidiers are called on the gee object.

The PR linked above proposes an approach to error more informatively here. :-)

shah-in-boots commented 3 years ago

7 seems to be up and running. The last issue is the predict() function, which I still will have to write from scratch since its not built in to the original dispatch for gee::gee().

topepo commented 3 years ago

I think that you can follow the existing code for glm models. Since it inherits that method, there would not be a specific predict() method.

topepo commented 3 years ago

I think that you can follow the existing code for glm models. Since it inherits that method, there would not be a specific predict() method.

Well that was uncharacteristically naive of me.

PR was merged.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.