High performance modifications - Githubissues

patperry / r-mbest

Moment-Based Estimation for Hierarchical Models (R Package)

Apache License 2.0

24 stars 6 forks source link

High performance modifications #1

Closed hacktuarial closed 8 years ago

hacktuarial commented 8 years ago

This PR has two major modifications to the original mbest implementation.

Fitting group-specific models in parallel using the foreach package.
Rewriting the observation-group index mapping in C++

I had some difficulty incorporating C++ into the package, and ultimately used Roxygen to get it done. Consequently I had to move much of the NAMESPACE import/export functionality into Roxygen format. Unfortunately, this destroyed the ability to conditionally import the sigma function from lme4 or stats depending on the R version.

patperry commented 8 years ago

Wow! Thanks for this!

Rcpp and ROxygen seemed like overkill here, so I stripped that stuff out.

patperry commented 8 years ago

(I have a very strong aversion to C++)

hacktuarial commented 8 years ago

I didn't know any other way to incorporate my C loop, so thanks for showing the way. I am super excited about the methodology in your paper, and glad to provide these performance tweaks.

patperry commented 8 years ago

Glad to hear it!

Just a heads up: I moved the "parallel" option to mhglm.control, and I removed the verbose option. The logging is all still there, but if you want to see it, you need to call basicConfig("INFO").

I made some other minor changes, including switching the whitespace to be consistent with the rest of the package (indent by 4 spaces, not 2). I had to make a few other changes to get R CMD check --as-cran to pass.

I submitted version 0.5 of the package to CRAN; it will hopefully appear in a few days.

patperry commented 8 years ago

By the way, Ningshan Zhang and I are working on some extensions to allow for many more predictor variables. Stay tuned...

hacktuarial commented 8 years ago

Good idea - the parallel option belongs more naturally in mhglm.control. Sorry about the 2 spaces vs 4 spaces inconsistency. I'm excited for it to be on CRAN soon!

I imagine the difficulty with more predictor variables is estimating the large number of terms in the covariance matrix - is this the problem you're working on? Some folks at Stitch Fix are thinking about hacking the method to enforce a diagonal covariance matrix. This would be similar to lme4's "double bar" notation, ||, but we don't really have much theoretical basis for it.

patperry commented 8 years ago

@hacktuarial I just looking through your code again, and I'm confused about the line

 if(parallel) x <- bigmemory::as.big.matrix(x, shared=TRUE)

in rdglm.group.fit. What's the reason for this? In the inner loop, when you call model <- rdglm.fit(x = x[j,,drop=FALSE], ...), the subsetting operation x[j,,drop=FALSE] returns a vanilla R matrix.

patperry commented 8 years ago

oh wait, is this so the foreach loop doesn't copy the object?

hacktuarial commented 8 years ago

Right - without this, foreach makes ncores copies of the original object which can blow up available memory. It's OK to copy subsets of the original object as matrices because they're fairly small, and since I don't need any other functionality from bigmemory, they can be matrix class instead of big.matrix class.

patperry commented 8 years ago

FYI, ranef.mhglm is still using sequential code (which means that predict.mhglm is also sequential). Up for adding parallel support to these functions? To make this easier, I changed mhglm to return the control object as part of its result, so you can test object$control$parallel in these functions.

hacktuarial commented 8 years ago

I'll work on it!

patperry commented 8 years ago

Great! It's probably not worth it to change anything in the predict.mhglm function; it may actually end up being slower if you do. Just focus your efforts on ranef.mhglm.