Closed shah-in-boots closed 3 years ago
Overall, I'm all for adding gee
(and this is the right place).
I'd need to figure out a good api since it doesn't specify the clustering variable via a formula.
I tried it out in an independent parsnip
model to see what issues would arise...
linear_reg
and logistic_reg
functions, but I haven't explored whats the best way to integrate themf <- mpg ~ wt
gee_reg(cluster = cyl, correlation = "independence") %>%
set_engine("geepack") %>%
fit(f, data = mtcars)
# Error in model.frame.default(formula = mpg ~ wt, data = data, id = ~cyl) :
# invalid type (language) for variable '(id)'
# Timing stopped at: 0 0 0
If you have any thoughts, would like to be able to add these into the modeling options in the future. It seems like a formula could be used, and could capture the clustering variable at the time of model fit.
I worked on this today. One possible solution is to use the formula method and specify the id there:
# example from ?gee
linear_reg() %>%
set_engine("gee", corstr = "exchangeable") %>%
fit(breaks ~ tension + id_var(wool), data = warpbreaks)
I've got it working for linear regression. If you want to test it out:
remotes::install_github("tidymodels/multilevelmod@gee-models")
To recap: gee::gee()
specifies the id/cluster variable using an argument id
that requires a vector. parsnip
doesn't work that way so we enable this model to be fit using an artificial function called id_var()
to be used in the formula. So, in the original package, the call would look like:
gee(breaks ~ tension, id = wool, data = warpbreaks, corstr = "exchangeable")
With parsnip
, you can use the formula breaks ~ tension + id_var(wool)
.
Here's an example or two:
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✓ broom 0.7.2 ✓ recipes 0.1.15
#> ✓ dials 0.0.9.9000 ✓ rsample 0.0.8
#> ✓ dplyr 1.0.2 ✓ tibble 3.0.4
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.2
#> ✓ infer 0.5.3 ✓ tune 0.1.2
#> ✓ modeldata 0.1.0 ✓ workflows 0.2.1
#> ✓ parsnip 0.1.4.9000 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
library(multilevelmod)
data(sleepstudy, package = "lme4")
gee_spec <-
linear_reg() %>%
set_engine("gee")
gee_fit <-
gee_spec %>%
fit(Reaction ~ Days + id_var(Subject), data = sleepstudy)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) Days
#> 251.40510 10.46729
gee_fit
#> parsnip model object
#>
#> Fit time: 11ms
#>
#> GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#> gee S-function, version 4.13 modified 98/01/27 (1998)
#>
#> Model:
#> Link: Identity
#> Variance to Mean Relation: Gaussian
#> Correlation Structure: Independent
#>
#> Call:
#> gee::gee(formula = Reaction ~ Days, id = data$Subject, data = data,
#> family = gaussian)
#>
#> Number of observations : 180
#>
#> Maximum cluster size : 10
#>
#>
#> Coefficients:
#> (Intercept) Days
#> 251.40510 10.46729
#>
#> Estimated Scale Parameter: 2276.694
#> Number of Iterations: 1
#>
#> Working Correlation[1:4,1:4]
#> [,1] [,2] [,3] [,4]
#> [1,] 1 0 0 0
#> [2,] 0 1 0 0
#> [3,] 0 0 1 0
#> [4,] 0 0 0 1
#>
#>
#> Returned Error Value:
#> [1] 0
# change correlation stucture:
exch_spec <-
linear_reg() %>%
set_engine("gee", corstr = "exchangeable")
exch_fit <-
exch_spec %>%
fit(Reaction ~ Days + id_var(Subject), data = sleepstudy)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) Days
#> 251.40510 10.46729
exch_fit
#> parsnip model object
#>
#> Fit time: 4ms
#>
#> GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#> gee S-function, version 4.13 modified 98/01/27 (1998)
#>
#> Model:
#> Link: Identity
#> Variance to Mean Relation: Gaussian
#> Correlation Structure: Exchangeable
#>
#> Call:
#> gee::gee(formula = Reaction ~ Days, id = data$Subject, data = data,
#> family = gaussian, corstr = "exchangeable")
#>
#> Number of observations : 180
#>
#> Maximum cluster size : 10
#>
#>
#> Coefficients:
#> (Intercept) Days
#> 251.40510 10.46729
#>
#> Estimated Scale Parameter: 2276.694
#> Number of Iterations: 1
#>
#> Working Correlation[1:4,1:4]
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0000000 0.5710385 0.5710385 0.5710385
#> [2,] 0.5710385 1.0000000 0.5710385 0.5710385
#> [3,] 0.5710385 0.5710385 1.0000000 0.5710385
#> [4,] 0.5710385 0.5710385 0.5710385 1.0000000
#>
#>
#> Returned Error Value:
#> [1] 0
Created on 2020-12-01 by the reprex package (v0.3.0)
It seems to be working for me. I'm happy to do a pull request and see if I can use your gee_fit and gee_formula to help make this work for the logistic regression component as well, or whatever else is needed (e.g. adding geepack
as well as gee
to the regression engines, as they use slightly different approaches for parameter estimations.
I like in particular your fit approach - there are several other regression types that use this clustering feature that is difficult to pass information through in advance. I think all this needs is a good set of tidiers and it should work.
library(multilevelmod)
#> Loading required package: parsnip
library(magrittr)
linear_reg() %>%
set_engine("gee", corstr = "independence") %>%
fit(mpg ~ wt + id_var(cyl), data = mtcars)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) wt
#> 37.285126 -5.344472
#> parsnip model object
#>
#> Fit time: 10ms
#>
#> GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
#> gee S-function, version 4.13 modified 98/01/27 (1998)
#>
#> Model:
#> Link: Identity
#> Variance to Mean Relation: Gaussian
#> Correlation Structure: Independent
#>
#> Call:
#> gee::gee(formula = mpg ~ wt, id = data$cyl, data = data, family = gaussian,
#> corstr = "independence")
#>
#> Number of observations : 32
#>
#> Maximum cluster size : 6
#>
#>
#> Coefficients:
#> (Intercept) wt
#> 37.285126 -5.344472
#>
#> Estimated Scale Parameter: 9.277398
#> Number of Iterations: 1
#>
#> Working Correlation[1:4,1:4]
#> [,1] [,2] [,3] [,4]
#> [1,] 1 1 1 0
#> [2,] 0 0 0 0
#> [3,] 0 0 0 0
#> [4,] 0 0 0 0
#>
#>
#> Returned Error Value:
#> [1] 0
You are welcome to take what's in my branch and run with it. I haven't done anything with logistic or Poisson regression in multilevelmod
yet but that is the plan. Those should be very similar to what is in multilevelmod
(and you can also consult what we did in parsnip
for those models).
I'm happy to help if anything is unclear (my documentation can always be better) or if you get stuck.
gee
objects are sub-classed as glm
so those tiders should kick in automatically. Unfortunately:
> class(exch_fit$fit)
[1] "gee" "glm"
> tidy(exch_fit$fit)
Error: No tidy method for objects of class gee
Perhaps @alexpghayes knows why the method dispatch does not forward to tidy.glm()
. It would probably fail anyway since BDR does not use column names that are compatible with glm
objects:
> summary(exch_fit$fit)$coef
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 251.40510 9.5378078 26.35879 6.632277 37.9063
Days 10.46729 0.8109579 12.90731 1.502237 6.9678
(╯°□°)╯︵ ┻━┻
Perhaps @alexpghayes knows why the method dispatch does not forward to tidy.glm().
Yeah this is because people subclass glm
all the time without actually behaving like a glm
object so we error out purposefully when we detect this is the case. Previous behaviour will vary with your version of broom
. We updated the behaviour for lm
objects recently and should do the same for glm
objects (see https://github.com/tidymodels/broom/blob/master/R/stats-lm-tidiers.R#L83). cc @simonpcouch
The method dispatch doesn't forward to tidy.glm
because the gee
tidiers are explicitly overwritten with the *.default
methods to error out. As @topepo predicted, though, they do error out if the *.glm
tidiers are called on the gee
object.
The PR linked above proposes an approach to error more informatively here. :-)
predict()
function, which I still will have to write from scratch since its not built in to the original dispatch for gee::gee()
.I think that you can follow the existing code for glm
models. Since it inherits that method, there would not be a specific predict()
method.
I think that you can follow the existing code for
glm
models. Since it inherits that method, there would not be a specificpredict()
method.
Well that was uncharacteristically naive of me.
PR was merged.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Hi. I've started working on making GEE model specifications using the
gee
package to be compatible withparsnip
. I'm wondering if its better than, instead of keeping them just in my own loading functions, to integrate it with pre-existing packages, but not sure if:GLMs and GEE are similar to me, but have different interpretations. I would like to be able to add population-based interpretations (instead of cluster-specific subject interpretations).