Open dmbates opened 6 years ago
I am able to fit a model using MixedModels
through JuliaCall
. Right now I am having some difficulties with the initial call to julia_setup()
hanging sometimes and I haven't tracked that down but the results are encouraging. I'll probably check in with @Non-Contradiction and perhaps with @randy3k (the RCall
expert) and exchange ideas about how best to accomplish this. @bbolker and @mmaechler - do you want to be part of the discussion?
What is the problem in julia_setup
? Maybe open an issue for JuliaCall
?
With help from @Non-Contradiction I have some parts working.
> library(JuliaCall)
> library(lme4)
Loading required package: Matrix
> j <- julia_setup() # this can take a while
Julia version 0.6.2 at location /usr/local/src/julia-d386e40c17/bin will be used.
Julia initiation...
Finish Julia initiation.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.
> j$library("MixedModels")
WARNING: Method definition model_response(StatsModels.ModelFrame) in module StatsModels at /home/bates/.julia/v0.6/StatsModels/src/modelframe.jl:179 overwritten in module MixedModels at /home/bates/.julia/v0.6/MixedModels/src/pls.jl:61.
WARNING: Method definition model_response(StatsModels.ModelFrame) in module StatsModels at /home/bates/.julia/v0.6/StatsModels/src/modelframe.jl:179 overwritten in module MixedModels at /home/bates/.julia/v0.6/MixedModels/src/pls.jl:61.
> j$assign("insteval", InstEval) # copy data.frame to Julia
> j$eval("fm1 = fit!(lmm(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval))")
Julia Object of type MixedModels.LinearMixedModel{Float64}.
Linear mixed model fit by maximum likelihood
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
logLik -2 logLik AIC BIC
-1.18860884×10⁵ 2.37721769×10⁵ 2.37733769×10⁵ 2.37788993×10⁵
Variance components:
Column Variance Std.Dev.
s (Intercept) 0.1059726808 0.32553445
d (Intercept) 0.2652041233 0.51497973
dept (Intercept) 0.0061677025 0.07853472
Residual 1.3864886097 1.17749251
Number of obs: 73421; levels of grouping factors: 2972, 1128, 14
Fixed-effects parameters:
Estimate Std.Error z value P(>|z|)
(Intercept) 3.28258 0.0284114 115.537 <1e-99
service: 1 -0.0925886 0.0133832 -6.91828 <1e-11
> # refit for timing. First Julia fit is slower due to Just-In-Time (JIT) compilation
> j$command("@time fit!(lmm(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval));")
2.698193 seconds (350.31 k allocations: 203.566 MiB, 4.21% gc time)
> system.time(fm1 <- lmer(y ~ 1 + service + (1|s) + (1|d) + (1|dept), InstEval, REML=FALSE))
user system elapsed
20.877 0.034 20.914
> summary(fm1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
Data: InstEval
AIC BIC logLik deviance df.resid
237733.8 237789.0 -118860.9 237721.8 73415
Scaled residuals:
Min 1Q Median 3Q Max
-3.0595 -0.7479 0.0405 0.7723 3.1987
Random effects:
Groups Name Variance Std.Dev.
s (Intercept) 0.105973 0.32553
d (Intercept) 0.265204 0.51498
dept (Intercept) 0.006168 0.07854
Residual 1.386489 1.17749
Number of obs: 73421, groups: s, 2972; d, 1128; dept, 14
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.28258 0.02841 115.54
service1 -0.09259 0.01338 -6.92
Correlation of Fixed Effects:
(Intr)
service1 -0.156
@randy3k is taking a look at direct translation of R formulas to Julia formulas which will make things cleaner. In Julia a formula is either created explicitly (i.e. Formula(:y, :(1+service+(1|s)+(1|d)+(1|dept)))
or by calling the formula
macro as above. If the RCall
package provides rcopy
methods for an R formula then the formula and data objects can be passed from R without explicitly copying them to Julia.
I've made some progress on this, which I will hopefully wrap up into a very rudimentary package and throw it up in GitHub in the next few days.
But using some of the slightly changed syntax/names.
> library("JuliaCall")
> library("lme4")
Loading required package: Matrix
>
> julia_setup() # this can take a while
Julia version 1.1.0 at location /usr/local/bin will be used.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.
> julia_library("MixedModels")
> julia_assign("insteval", InstEval) # copy data.frame to Julia
> julia_command("@time fm1 = fit!(LinearMixedModel(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval),REML=false)")
21.746567 seconds (34.87 M allocations: 1.900 GiB, 3.43% gc time)
Linear mixed model fit by maximum likelihood
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
logLik -2 logLik AIC BIC
-1.18860884×10⁵ 2.37721769×10⁵ 2.37733769×10⁵ 2.37788993×10⁵
Variance components:
Column Variance Std.Dev.
s (Intercept) 0.1059726260 0.325534370
d (Intercept) 0.2652046180 0.514980211
dept (Intercept) 0.0061677886 0.078535270
Residual 1.3864885909 1.177492501
Number of obs: 73421; levels of grouping factors: 2972, 1128, 14
Fixed-effects parameters:
Estimate Std.Error z value P(>|z|)
(Intercept) 3.28258 0.0284115 115.537 <1e-99
service: 1 -0.0925886 0.0133832 -6.91828 <1e-11
> # refit for timing. First Julia fit is slower due to Just-In-Time (JIT) compilation
> system.time(julia_command("@time fm1 = fit!(LinearMixedModel(@formula(y ~ 1 + service + (1|s) + (1|d) + (1|dept)), insteval));"))
2.928610 seconds (468.10 k allocations: 210.866 MiB, 3.06% gc time)
user system elapsed
13.696 6.153 2.934
> system.time(fm1 <- lmer(y ~ 1 + service + (1|s) + (1|d) + (1|dept), InstEval, REML=FALSE))
user system elapsed
18.918 0.074 19.124
> summary(fm1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
Data: InstEval
AIC BIC logLik deviance df.resid
237733.8 237789.0 -118860.9 237721.8 73415
Scaled residuals:
Min 1Q Median 3Q Max
-3.0595 -0.7479 0.0405 0.7723 3.1987
Random effects:
Groups Name Variance Std.Dev.
s (Intercept) 0.105968 0.32553
d (Intercept) 0.265207 0.51498
dept (Intercept) 0.006165 0.07852
Residual 1.386490 1.17749
Number of obs: 73421, groups: s, 2972; d, 1128; dept, 14
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.28258 0.02841 115.550
service1 -0.09259 0.01338 -6.918
Correlation of Fixed Effects:
(Intr)
service1 -0.156
Even though it means that the same computations are done twice, once in R and once in Julia, we can take advantage of lme4's modular structure to generate the bits of a merMod
object that aren't dependent on the completed optimization. For non trivial models, this doesn't slow us down that much:
> # Cheating by depending on lme4's modular infrastructure doesn't hurt us that much.
> # Here's the breakdown of how long each step takes:
>
> system.time(parsedFormula <- lFormula(formula = y ~ 1 + service + (1|s) + (1|d) + (1|dept),
+ data = InstEval,
+ REML=FALSE))
user system elapsed
0.810 0.012 0.823
> system.time(devianceFunction <- do.call(mkLmerDevfun, parsedFormula))
user system elapsed
0.511 0.001 0.511
> system.time(optimizerOutput <- optimizeLmer(devianceFunction))
user system elapsed
18.165 0.382 18.602
> system.time(mkMerMod( rho = environment(devianceFunction),
+ opt = optimizerOutput,
+ reTrms = parsedFormula$reTrms,
+ fr = parsedFormula$fr))
user system elapsed
0.002 0.000 0.003
So here's a wrapper function that takes advantage of all that:
jmer <- function(formula, data, REML=TRUE){
julia_library("MixedModels")
# how to handle implicit intercept?
#
# how are contrasts set via contrasts() handled?
#
# how are weights and offsets handled?
#
# we also otherwise assume that we're using the subset of the R formula
# syntax supported by Julia should probably add in some additional checks
# here.
jf <- deparse(formula)
jreml = ifelse(REML, "true", "false")
julia_assign("jlmerdat",data)
jcall <- sprintf("jm = fit!(LinearMixedModel(@formula(%s),jlmerdat),REML=%s);",jf,jreml)
jout <- julia_command(jcall)
optimizerOutput <- list(par=julia_eval("jm.optsum.final"),
fval=julia_eval("jm.optsum.fmin"),
feval=julia_eval("jm.optsum.feval"),
# MixedModels.jl doesn't yet implement a lot of the
# post-fit convergence checks that lme4 does but a very
# crude one is provided by checking whether we reached
# iteration stop. Julia has a few good packages for
# Automatic Differentiation, maybe it's worthwhile to
# use those for the gradient and Hessian checks to
# really speed things up?
conv=julia_eval("jm.optsum.maxfeval") == julia_eval("jm.optsum.feval"),
message=julia_eval("jm.optsum.returnvalue"),
optimizer=julia_eval("jm.optsum.optimizer"))
# we could extract this from the julia object, but the parsing is quite fast
# in lme4 and then we don't need to worry about converting formats and
# datatypes
parsedFormula <- lFormula(formula=formula,data=data, REML=REML)
# this bit should probably be reworked to extract the julia fields
devianceFunction <- do.call(mkLmerDevfun, parsedFormula)
rho <- environment(devianceFunction)
# especially rho$resp seems relevant
mkMerMod(rho = rho,
opt = optimizerOutput,
reTrms = parsedFormula$reTrms,
fr = parsedFormula$fr)
}
Note that this assumes that the model is "nice" and belongs to the subset where lme4 and MixedModels.jl line up perfectly, but that's good enough for now.
But it's still faster than lme4:
> system.time(jm1 <- jmer(y ~ 1 + service + (1|s) + (1|d) + (1|dept), InstEval, REML=FALSE))
user system elapsed
17.689 7.176 6.537
There do appear to be few things going wrong though. Note that the Julia fit above was identical to the lme4 fit, but that's not the case here:
> summary(jm1)
Linear mixed model fit by maximum likelihood ['lmerMod']
AIC BIC logLik deviance df.resid
237733.8 237789.0 -118860.9 237721.8 73415
Scaled residuals:
Min 1Q Median 3Q Max
-3.1698 -0.7467 0.0412 0.7671 3.2899
Random effects:
Groups Name Variance Std.Dev.
s (Intercept) 0.17954 0.4237
d (Intercept) 0.32511 0.5702
dept (Intercept) 0.01434 0.1198
Residual 1.36613 1.1688
Number of obs: 73421, groups: s, 2972; d, 1128; dept, 14
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.28629 0.03850 85.352
service1 -0.08963 0.01355 -6.613
Correlation of Fixed Effects:
(Intr)
service1 -0.117
> summary(fm1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ 1 + service + (1 | s) + (1 | d) + (1 | dept)
Data: InstEval
AIC BIC logLik deviance df.resid
237733.8 237789.0 -118860.9 237721.8 73415
Scaled residuals:
Min 1Q Median 3Q Max
-3.0595 -0.7479 0.0405 0.7723 3.1987
Random effects:
Groups Name Variance Std.Dev.
s (Intercept) 0.105968 0.32553
d (Intercept) 0.265207 0.51498
dept (Intercept) 0.006165 0.07852
Residual 1.386490 1.17749
Number of obs: 73421, groups: s, 2972; d, 1128; dept, 14
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.28258 0.02841 115.550
service1 -0.09259 0.01338 -6.918
Correlation of Fixed Effects:
(Intr)
service1 -0.156
Since the fixed-effects estimates look okay and the random-effects are off by a relatively small amount, I'm guessing that I've forgotten to scale something somewhere when moving things back to R. Any pointers on this front would be much appreciated. Am I missing something really obvious @dmbates or @bbolker ?
Beyond that, there's a fair amount of error-checking, etc. that could be added as this is wrapped up into a package as well as larger differences for more-interesting models (e.g. offsets, weights, families, contrast condings, which side of the bridge to use for dealing with missing data, how to handle factors where not all levels are present in the data, etc.)
Was just getting geared up to try this out. Currently segfaults on my machine from a clean session. Any ideas? Should I post this on the JuliaCall issues list ? (julia --version
returns 1.1.1)
> sessionInfo()
R Under development (unstable) (2019-06-28 r76752)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_CA.UTF8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF8 LC_COLLATE=en_CA.UTF8
[5] LC_MONETARY=en_CA.UTF8 LC_MESSAGES=en_CA.UTF8
[7] LC_PAPER=en_CA.UTF8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.7.0
> library(JuliaCall)
> julia_setup()
Julia version 1.1.1 at location /usr/local/lib/julia-1.1.1/bin will be used.
Error in dyn.load(.julia$dll_file) :
unable to load shared object '/usr/local/lib/julia-1.1.1/bin/../lib/libjulia.so.1':
/usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /usr/local/lib/julia-1.1.1/bin/../lib/julia/libLLVM-6.0.so)
*** caught segfault ***
address 0x38, cause 'memory not mapped'
Traceback:
1: juliacall_initialize(.julia$dll_file)
2: julia_setup()
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Hmmm, that might be a documentation issue in JuliaCall
, but it looks like you're missing the right g++ version. Are you able to start Julia directly from the command line / not from within R? If not and you installed this via a PPA, then it's a packaging issue for Julia itself.
Yes, it seems that the correct version of libstdc++.so.6 is not found by R. Maybe the system version of the library is a little old. The good thing is that Julia binary brings the correct version of libstdc++ itself under the folder '/usr/local/lib/julia-1.1.1/bin/../lib/' So the problem could be typically solved by exporting the folder into R_LD_LIBRARY_PATH or LD_LIBRARY_PATH so the Julia library could be found by R as in https://github.com/Non-Contradiction/JuliaCall#troubleshooting-and-ways-to-get-help.
thanks! I've looked at the troubleshooting advice but so far failed to get anywhere. I've opened an issue on the JuliaCall repo enumerating what I've tried ... thanks for the help!
Did someone ever work out how to get the correct random-effect variances from Julia into R?
Not yet in a pretty wrapper, but see this example.
Note that the limit on optimization steps when converting the model to R can give wrong results if you're using fancy contrast coding unless you take steps to get those contrasts in Julia.
For the record, I haven't yet managed to sort out my problems with JuliaCall ... hopefully I will get back to it soon, this does seem like a great idea.
I also have some more extensive work on adding this functionality to MixedModels.jl for use with RCall from the Julia side. When that implementation is done, it will make creating the wrapper around the JuliaCall much easier. All that said, I don't know if that functionality will stay part of MixedModels.jl or will be spun off into a separate package (because requiring RCall is a pretty big dependency and doing the necessary testing on both R and Julia makes the test suite a LOT bigger).
I believe I have got this working. I did have to hard code the formula when calling lFormula for some reason. I am not sure what I was doing wrong. My formula was:
outcome ~ 0 + outcome_no + bp_treat + statin_use + agemj:outcome_no + (1 + outcome_no | patid)
Both outcome_no and patid are factors
This produced the error:
ERROR: REvalError: Warning in Ops.factor(0, outcome_no) : ‘+’ not meaningful for factors Warning in Ops.factor(1, outcome_no) : ‘+’ not meaningful for factors Warning in Ops.factor(1 + outcome_no, patid) : ‘+’ not meaningful for factors Warning in Ops.factor(outcome_no, (1 + outcome_no + patid)) : ‘+’ not meaningful for factors Error: couldn't evaluate grouping factor patid within model frame: try adding grouping factor to data frame explicitly if possible
Ideally, I would use (0 + outcome_no | patid) rather than (1 + outcome_no | patid) like I do with lmer as it saves fiddling around with an intercept. 0 removes the intercept in Julia but doesn't add the missing term back in. Am I missing syntax here?
Fortunately, I seemed to avoid the difference in the random effects as seen above. I believe I am running into similar issues to https://github.com/JuliaStats/MixedModels.jl/issues/160 Julia uses several times as much memory as lme or lmer use in R for the same model. I currently just about have the memory but the purpose of switching to Julia was to allow expansion to more parameters/factor levels. This will be limited by memory I think.
As a rule, MixedModels.jl should use less memory than lme4, both because of the differences in how memory allocation is handled between the languages and because of a few algorithmic improvements to the actual fitting process. If you're running this via JuliaCall though, it will use more memory, because you have: (1) the dataframe in R, (2) the dataframe in Julia, (3) the mixed model in Julia, (4) the mixed model in R (including a few steps of fitting process!) and potentially (5) a copy of the dataframe in R as part of the lmerMod object (this should just be a reference to the original dataframe, but I can imagine a few ways in which this would be a proper copy).
Regarding the formula: In the R code snippet linked above, the formula is copied as a string into Julia, which will work fine for the most common formula bits (+
, intercept, *
, and the lme4 grouping syntax), but will fail for most everything else. In particular, the Julia syntax for interactions (&
) is different than R's (:
). There's also some issues related to implicit conversion of characters/strings into categorical types -- in R, this will usually work (at least in core packages), but this is not the case in Julia.
As a sidebar: looking at your formula -- is this a GLMM? The bridge between lme4 and Julia is somewhat harder for GLMMs (which is why I haven't finished those bits yet). There are questions about what we mean when we say 'deviance' (#375) and some subtleties in how to discuss models with dichotomous outcomes (Bernoulli vs. Binomial, alternative syntax for various ways of representing binomial data). These are all things I plan on implementing at some point in the indeterminate middle future.
All that said: have you thought about doing this from within Julia? Using RCall, you can very easily do all the data manipulation up until fitting in R (as in: using R from within Julia), then do the model fitting in Julia. If you do go this route and still have performance difficulties, raise an issue over on the MixedModels.jl. We can of course help more if you're able to share your data (even privately).
I will experiment further with just having the data in Julia to reduce memory usage. I will need the model converted to lme4 at some point but that could be done on a separate job with more memory if necessary. This should be quick compared with running the models. With the Julia package, is it possible to calculate BLUPs using new levels of the random effects? This is necessary for cross-validation and applying the model in real practice to new patients. I have software that will do it based on lme and lmer.
I am fitting a multivariate model using univariate software. I iteratively run models rescaling each of the outcomes so that they have the same residual error variance. I have confirmed that this works using lme4 comparing to lme with varIdent. I just need more speed.
Unfortunately, I can't share data as it is health data.
JuliaCall in R uses RCall from the embedded Julia session, so extending RCall to support lme4/MixedModels covers a lot of the necessary ground for package compatibilty. I have spun my effort to do this off as a draft Julia package: JellyMe4. This should make parts of the sub-skeletal R package jlme
easier to write. While I'm self-advertising, you can also use this bridge to take advantage of lmerOut
to create nice HTML and LaTeX for model summaries.
That said, I'm not particularly motivated to put forth any effort in jlme
, as I increasingly spend less and less time in R and more and more time in Julia.
@bbolker I think this basically closes this issue by passing it on to my little side projects....
Would it be worthwhile my writing up a notebook on how to call the
MixedModels
package functionslmm
andglmm
from withinR
? There is anR
packageJuliaCall
and I could try, perhaps with help from @Non-Contradiction, to work out some examples of fitting models using eitherlmm
orglmm
.I know how to call R from Julia, having been one of the authors of the
RCall
package, but I haven't had much experience calling Julia from R.One question would be what to do with the fitted model. The simplest thing is to pass the converged values of θ parameters back and use them in a
lmer
orglmer
structure but there may be more sophisticated approaches.