lme4 / lme4

Mixed-effects models in R using S4 classes and methods with RcppEigen
Other
621 stars 146 forks source link

Calling MixedModels.lmm from R using JuliaCall #460

Open dmbates opened 6 years ago

dmbates commented 6 years ago

Would it be worthwhile my writing up a notebook on how to call the MixedModels package functions lmm and glmm from within R? There is an R package JuliaCall and I could try, perhaps with help from @Non-Contradiction, to work out some examples of fitting models using either lmm or glmm.

I know how to call R from Julia, having been one of the authors of the RCall package, but I haven't had much experience calling Julia from R.

One question would be what to do with the fitted model. The simplest thing is to pass the converged values of θ parameters back and use them in a lmer or glmer structure but there may be more sophisticated approaches.

dmbates commented 6 years ago

I am able to fit a model using MixedModels through JuliaCall. Right now I am having some difficulties with the initial call to julia_setup() hanging sometimes and I haven't tracked that down but the results are encouraging. I'll probably check in with @Non-Contradiction and perhaps with @randy3k (the RCall expert) and exchange ideas about how best to accomplish this. @bbolker and @mmaechler - do you want to be part of the discussion?

Non-Contradiction commented 6 years ago

What is the problem in julia_setup? Maybe open an issue for JuliaCall?

dmbates commented 6 years ago

Cross-ref to https://github.com/Non-Contradiction/JuliaCall/issues/34

dmbates commented 6 years ago

With help from @Non-Contradiction I have some parts working.

> library(JuliaCall)
> library(lme4)
Loading required package: Matrix
> j <- julia_setup()  # this can take a while
Julia version 0.6.2 at location /usr/local/src/julia-d386e40c17/bin will be used.
Julia initiation...
Finish Julia initiation.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.
> j$library("MixedModels")
WARNING: Method definition model_response(StatsModels.ModelFrame) in module StatsModels at /home/bates/.julia/v0.6/StatsModels/src/modelframe.jl:179 overwritten in module MixedModels at /home/bates/.julia/v0.6/MixedModels/src/pls.jl:61.
WARNING: Method definition model_response(StatsModels.ModelFrame) in module StatsModels at /home/bates/.julia/v0.6/StatsModels/src/modelframe.jl:179 overwritten in module MixedModels at /home/bates/.julia/v0.6/MixedModels/src/pls.jl:61.
> j$assign("insteval", InstEval)  # copy data.frame to Julia
> j$eval("fm1 = fit!(lmm(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval))")
Julia Object of type MixedModels.LinearMixedModel{Float64}.
Linear mixed model fit by maximum likelihood
 Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
     logLik        -2 logLik          AIC             BIC       
 -1.18860884×10⁵  2.37721769×10⁵  2.37733769×10⁵  2.37788993×10⁵

Variance components:
              Column     Variance    Std.Dev.  
 s        (Intercept)  0.1059726808 0.32553445
 d        (Intercept)  0.2652041233 0.51497973
 dept     (Intercept)  0.0061677025 0.07853472
 Residual              1.3864886097 1.17749251
 Number of obs: 73421; levels of grouping factors: 2972, 1128, 14

  Fixed-effects parameters:
               Estimate Std.Error  z value P(>|z|)
(Intercept)     3.28258 0.0284114  115.537  <1e-99
service: 1   -0.0925886 0.0133832 -6.91828  <1e-11
> # refit for timing.  First Julia fit is slower due to Just-In-Time (JIT) compilation
> j$command("@time fit!(lmm(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval));")
  2.698193 seconds (350.31 k allocations: 203.566 MiB, 4.21% gc time)
> system.time(fm1 <- lmer(y ~ 1 + service + (1|s) + (1|d) + (1|dept), InstEval, REML=FALSE))
   user  system elapsed 
 20.877   0.034  20.914 
> summary(fm1)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
   Data: InstEval

      AIC       BIC    logLik  deviance  df.resid 
 237733.8  237789.0 -118860.9  237721.8     73415 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0595 -0.7479  0.0405  0.7723  3.1987 

Random effects:
 Groups   Name        Variance Std.Dev.
 s        (Intercept) 0.105973 0.32553 
 d        (Intercept) 0.265204 0.51498 
 dept     (Intercept) 0.006168 0.07854 
 Residual             1.386489 1.17749 
Number of obs: 73421, groups:  s, 2972; d, 1128; dept, 14

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.28258    0.02841  115.54
service1    -0.09259    0.01338   -6.92

Correlation of Fixed Effects:
         (Intr)
service1 -0.156

@randy3k is taking a look at direct translation of R formulas to Julia formulas which will make things cleaner. In Julia a formula is either created explicitly (i.e. Formula(:y, :(1+service+(1|s)+(1|d)+(1|dept))) or by calling the formula macro as above. If the RCall package provides rcopy methods for an R formula then the formula and data objects can be passed from R without explicitly copying them to Julia.

palday commented 5 years ago

I've made some progress on this, which I will hopefully wrap up into a very rudimentary package and throw it up in GitHub in the next few days.

Review from previous efforts

But using some of the slightly changed syntax/names.

> library("JuliaCall")
> library("lme4")
Loading required package: Matrix
> 
> julia_setup()  # this can take a while
Julia version 1.1.0 at location /usr/local/bin will be used.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.
> julia_library("MixedModels")
> julia_assign("insteval", InstEval)  # copy data.frame to Julia
> julia_command("@time fm1 = fit!(LinearMixedModel(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval),REML=false)")
 21.746567 seconds (34.87 M allocations: 1.900 GiB, 3.43% gc time)
Linear mixed model fit by maximum likelihood
 Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
     logLik        -2 logLik          AIC             BIC       
 -1.18860884×10⁵  2.37721769×10⁵  2.37733769×10⁵  2.37788993×10⁵

Variance components:
              Column     Variance     Std.Dev.  
 s        (Intercept)  0.1059726260 0.325534370
 d        (Intercept)  0.2652046180 0.514980211
 dept     (Intercept)  0.0061677886 0.078535270
 Residual              1.3864885909 1.177492501
 Number of obs: 73421; levels of grouping factors: 2972, 1128, 14

  Fixed-effects parameters:
               Estimate Std.Error  z value P(>|z|)
(Intercept)     3.28258 0.0284115  115.537  <1e-99
service: 1   -0.0925886 0.0133832 -6.91828  <1e-11

> # refit for timing.  First Julia fit is slower due to Just-In-Time (JIT) compilation
> system.time(julia_command("@time fm1 = fit!(LinearMixedModel(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval));"))
  2.928610 seconds (468.10 k allocations: 210.866 MiB, 3.06% gc time)
   user  system elapsed 
 13.696   6.153   2.934 
> system.time(fm1 <- lmer(y ~ 1 + service + (1|s) + (1|d) + (1|dept), InstEval, REML=FALSE))
   user  system elapsed 
 18.918   0.074  19.124 
> summary(fm1)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
   Data: InstEval

      AIC       BIC    logLik  deviance  df.resid 
 237733.8  237789.0 -118860.9  237721.8     73415 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0595 -0.7479  0.0405  0.7723  3.1987 

Random effects:
 Groups   Name        Variance Std.Dev.
 s        (Intercept) 0.105968 0.32553 
 d        (Intercept) 0.265207 0.51498 
 dept     (Intercept) 0.006165 0.07852 
 Residual             1.386490 1.17749 
Number of obs: 73421, groups:  s, 2972; d, 1128; dept, 14

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.28258    0.02841 115.550
service1    -0.09259    0.01338  -6.918

Correlation of Fixed Effects:
         (Intr)
service1 -0.156

Take advantage of lme4's modular structure

Even though it means that the same computations are done twice, once in R and once in Julia, we can take advantage of lme4's modular structure to generate the bits of a merMod object that aren't dependent on the completed optimization. For non trivial models, this doesn't slow us down that much:

> # Cheating by depending on lme4's modular infrastructure doesn't hurt us that much.
> # Here's the breakdown of how long each step takes:
> 
> system.time(parsedFormula <- lFormula(formula = y ~ 1 + service + (1|s) + (1|d) + (1|dept),
+                                       data = InstEval,
+                                       REML=FALSE))
   user  system elapsed 
  0.810   0.012   0.823 
> system.time(devianceFunction <- do.call(mkLmerDevfun, parsedFormula))
   user  system elapsed 
  0.511   0.001   0.511 
> system.time(optimizerOutput <- optimizeLmer(devianceFunction))
   user  system elapsed 
 18.165   0.382  18.602 
> system.time(mkMerMod( rho = environment(devianceFunction),
+                       opt = optimizerOutput,
+                       reTrms = parsedFormula$reTrms,
+                       fr = parsedFormula$fr))
   user  system elapsed 
  0.002   0.000   0.003 

So here's a wrapper function that takes advantage of all that:

jmer <- function(formula, data, REML=TRUE){
    julia_library("MixedModels")
    # how to handle implicit intercept?
    #
    # how are contrasts set via contrasts() handled?
    #
    # how are weights and offsets handled?
    #
    # we also otherwise assume that we're using the subset of the R formula
    # syntax supported by Julia should probably add in some additional checks
    # here.
    jf <- deparse(formula)
    jreml = ifelse(REML, "true", "false")

    julia_assign("jlmerdat",data)
    jcall <- sprintf("jm = fit!(LinearMixedModel(@formula(%s),jlmerdat),REML=%s);",jf,jreml)

    jout <- julia_command(jcall)

    optimizerOutput <- list(par=julia_eval("jm.optsum.final"),
                         fval=julia_eval("jm.optsum.fmin"),
                         feval=julia_eval("jm.optsum.feval"),
                         # MixedModels.jl doesn't yet implement a lot of the
                         # post-fit convergence checks that lme4 does but a very
                         # crude one is provided by checking whether we reached
                         # iteration stop. Julia has a few good packages for
                         # Automatic Differentiation, maybe it's worthwhile to
                         # use those for the gradient and Hessian checks to
                         # really speed things up?
                         conv=julia_eval("jm.optsum.maxfeval") == julia_eval("jm.optsum.feval"),
                         message=julia_eval("jm.optsum.returnvalue"),
                         optimizer=julia_eval("jm.optsum.optimizer"))

    # we could extract this from the julia object, but the parsing is quite fast
    # in lme4 and then we don't need to worry about converting formats and
    # datatypes

    parsedFormula <- lFormula(formula=formula,data=data, REML=REML)
    # this bit should probably be reworked to extract the julia fields
    devianceFunction <- do.call(mkLmerDevfun, parsedFormula)
    rho <- environment(devianceFunction)
    # especially rho$resp seems relevant

    mkMerMod(rho = rho,
            opt = optimizerOutput,
            reTrms = parsedFormula$reTrms,
            fr = parsedFormula$fr)
}

Note that this assumes that the model is "nice" and belongs to the subset where lme4 and MixedModels.jl line up perfectly, but that's good enough for now.

There is some overhead shuttling things across the bridge

But it's still faster than lme4:

> system.time(jm1 <- jmer(y ~ 1 + service + (1|s) + (1|d) + (1|dept), InstEval, REML=FALSE))
   user  system elapsed 
 17.689   7.176   6.537 

Something is still missing though

There do appear to be few things going wrong though. Note that the Julia fit above was identical to the lme4 fit, but that's not the case here:

> summary(jm1)
Linear mixed model fit by maximum likelihood  ['lmerMod']

      AIC       BIC    logLik  deviance  df.resid 
 237733.8  237789.0 -118860.9  237721.8     73415 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.1698 -0.7467  0.0412  0.7671  3.2899 

Random effects:
 Groups   Name        Variance Std.Dev.
 s        (Intercept) 0.17954  0.4237  
 d        (Intercept) 0.32511  0.5702  
 dept     (Intercept) 0.01434  0.1198  
 Residual             1.36613  1.1688  
Number of obs: 73421, groups:  s, 2972; d, 1128; dept, 14

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.28629    0.03850  85.352
service1    -0.08963    0.01355  -6.613

Correlation of Fixed Effects:
         (Intr)
service1 -0.117
> summary(fm1)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
   Data: InstEval

      AIC       BIC    logLik  deviance  df.resid 
 237733.8  237789.0 -118860.9  237721.8     73415 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0595 -0.7479  0.0405  0.7723  3.1987 

Random effects:
 Groups   Name        Variance Std.Dev.
 s        (Intercept) 0.105968 0.32553 
 d        (Intercept) 0.265207 0.51498 
 dept     (Intercept) 0.006165 0.07852 
 Residual             1.386490 1.17749 
Number of obs: 73421, groups:  s, 2972; d, 1128; dept, 14

Fixed effects:
            Estimate Std. Error t value
(Intercept)  3.28258    0.02841 115.550
service1    -0.09259    0.01338  -6.918

Correlation of Fixed Effects:
         (Intr)
service1 -0.156

Since the fixed-effects estimates look okay and the random-effects are off by a relatively small amount, I'm guessing that I've forgotten to scale something somewhere when moving things back to R. Any pointers on this front would be much appreciated. Am I missing something really obvious @dmbates or @bbolker ?

Beyond that, there's a fair amount of error-checking, etc. that could be added as this is wrapped up into a package as well as larger differences for more-interesting models (e.g. offsets, weights, families, contrast condings, which side of the bridge to use for dealing with missing data, how to handle factors where not all levels are present in the data, etc.)

bbolker commented 5 years ago

Was just getting geared up to try this out. Currently segfaults on my machine from a clean session. Any ideas? Should I post this on the JuliaCall issues list ? (julia --version returns 1.1.1)

> sessionInfo()
R Under development (unstable) (2019-06-28 r76752)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF8       LC_NUMERIC=C             
 [3] LC_TIME=en_CA.UTF8        LC_COLLATE=en_CA.UTF8    
 [5] LC_MONETARY=en_CA.UTF8    LC_MESSAGES=en_CA.UTF8   
 [7] LC_PAPER=en_CA.UTF8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_CA.UTF8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.7.0
> library(JuliaCall)
> julia_setup()
Julia version 1.1.1 at location /usr/local/lib/julia-1.1.1/bin will be used.
Error in dyn.load(.julia$dll_file) : 
  unable to load shared object '/usr/local/lib/julia-1.1.1/bin/../lib/libjulia.so.1':
  /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /usr/local/lib/julia-1.1.1/bin/../lib/julia/libLLVM-6.0.so)

 *** caught segfault ***
address 0x38, cause 'memory not mapped'

Traceback:
 1: juliacall_initialize(.julia$dll_file)
 2: julia_setup()

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
palday commented 5 years ago

Hmmm, that might be a documentation issue in JuliaCall, but it looks like you're missing the right g++ version. Are you able to start Julia directly from the command line / not from within R? If not and you installed this via a PPA, then it's a packaging issue for Julia itself.

Non-Contradiction commented 5 years ago

Yes, it seems that the correct version of libstdc++.so.6 is not found by R. Maybe the system version of the library is a little old. The good thing is that Julia binary brings the correct version of libstdc++ itself under the folder '/usr/local/lib/julia-1.1.1/bin/../lib/' So the problem could be typically solved by exporting the folder into R_LD_LIBRARY_PATH or LD_LIBRARY_PATH so the Julia library could be found by R as in https://github.com/Non-Contradiction/JuliaCall#troubleshooting-and-ways-to-get-help.

bbolker commented 5 years ago

thanks! I've looked at the troubleshooting advice but so far failed to get anywhere. I've opened an issue on the JuliaCall repo enumerating what I've tried ... thanks for the help!

ds814 commented 4 years ago

Did someone ever work out how to get the correct random-effect variances from Julia into R?

palday commented 4 years ago

Not yet in a pretty wrapper, but see this example.

Note that the limit on optimization steps when converting the model to R can give wrong results if you're using fancy contrast coding unless you take steps to get those contrasts in Julia.

bbolker commented 4 years ago

For the record, I haven't yet managed to sort out my problems with JuliaCall ... hopefully I will get back to it soon, this does seem like a great idea.

palday commented 4 years ago

I also have some more extensive work on adding this functionality to MixedModels.jl for use with RCall from the Julia side. When that implementation is done, it will make creating the wrapper around the JuliaCall much easier. All that said, I don't know if that functionality will stay part of MixedModels.jl or will be spun off into a separate package (because requiring RCall is a pretty big dependency and doing the necessary testing on both R and Julia makes the test suite a LOT bigger).

ds814 commented 4 years ago

I believe I have got this working. I did have to hard code the formula when calling lFormula for some reason. I am not sure what I was doing wrong. My formula was:

outcome ~ 0 + outcome_no + bp_treat + statin_use + agemj:outcome_no + (1 + outcome_no | patid)

Both outcome_no and patid are factors

This produced the error:

ERROR: REvalError: Warning in Ops.factor(0, outcome_no) : ‘+’ not meaningful for factors Warning in Ops.factor(1, outcome_no) : ‘+’ not meaningful for factors Warning in Ops.factor(1 + outcome_no, patid) : ‘+’ not meaningful for factors Warning in Ops.factor(outcome_no, (1 + outcome_no + patid)) : ‘+’ not meaningful for factors Error: couldn't evaluate grouping factor patid within model frame: try adding grouping factor to data frame explicitly if possible

Ideally, I would use (0 + outcome_no | patid) rather than (1 + outcome_no | patid) like I do with lmer as it saves fiddling around with an intercept. 0 removes the intercept in Julia but doesn't add the missing term back in. Am I missing syntax here?

Fortunately, I seemed to avoid the difference in the random effects as seen above. I believe I am running into similar issues to https://github.com/JuliaStats/MixedModels.jl/issues/160 Julia uses several times as much memory as lme or lmer use in R for the same model. I currently just about have the memory but the purpose of switching to Julia was to allow expansion to more parameters/factor levels. This will be limited by memory I think.

palday commented 4 years ago

As a rule, MixedModels.jl should use less memory than lme4, both because of the differences in how memory allocation is handled between the languages and because of a few algorithmic improvements to the actual fitting process. If you're running this via JuliaCall though, it will use more memory, because you have: (1) the dataframe in R, (2) the dataframe in Julia, (3) the mixed model in Julia, (4) the mixed model in R (including a few steps of fitting process!) and potentially (5) a copy of the dataframe in R as part of the lmerMod object (this should just be a reference to the original dataframe, but I can imagine a few ways in which this would be a proper copy).

Regarding the formula: In the R code snippet linked above, the formula is copied as a string into Julia, which will work fine for the most common formula bits (+, intercept, *, and the lme4 grouping syntax), but will fail for most everything else. In particular, the Julia syntax for interactions (&) is different than R's (:). There's also some issues related to implicit conversion of characters/strings into categorical types -- in R, this will usually work (at least in core packages), but this is not the case in Julia.

As a sidebar: looking at your formula -- is this a GLMM? The bridge between lme4 and Julia is somewhat harder for GLMMs (which is why I haven't finished those bits yet). There are questions about what we mean when we say 'deviance' (#375) and some subtleties in how to discuss models with dichotomous outcomes (Bernoulli vs. Binomial, alternative syntax for various ways of representing binomial data). These are all things I plan on implementing at some point in the indeterminate middle future.

All that said: have you thought about doing this from within Julia? Using RCall, you can very easily do all the data manipulation up until fitting in R (as in: using R from within Julia), then do the model fitting in Julia. If you do go this route and still have performance difficulties, raise an issue over on the MixedModels.jl. We can of course help more if you're able to share your data (even privately).

ds814 commented 4 years ago

I will experiment further with just having the data in Julia to reduce memory usage. I will need the model converted to lme4 at some point but that could be done on a separate job with more memory if necessary. This should be quick compared with running the models. With the Julia package, is it possible to calculate BLUPs using new levels of the random effects? This is necessary for cross-validation and applying the model in real practice to new patients. I have software that will do it based on lme and lmer.

I am fitting a multivariate model using univariate software. I iteratively run models rescaling each of the outcomes so that they have the same residual error variance. I have confirmed that this works using lme4 comparing to lme with varIdent. I just need more speed.

Unfortunately, I can't share data as it is health data.

palday commented 4 years ago

JuliaCall in R uses RCall from the embedded Julia session, so extending RCall to support lme4/MixedModels covers a lot of the necessary ground for package compatibilty. I have spun my effort to do this off as a draft Julia package: JellyMe4. This should make parts of the sub-skeletal R package jlme easier to write. While I'm self-advertising, you can also use this bridge to take advantage of lmerOut to create nice HTML and LaTeX for model summaries.

That said, I'm not particularly motivated to put forth any effort in jlme, as I increasingly spend less and less time in R and more and more time in Julia.

@bbolker I think this basically closes this issue by passing it on to my little side projects....