stan-dev / loo

loo R package for approximate leave-one-out cross-validation (LOO-CV) and Pareto smoothed importance sampling (PSIS)
https://mc-stan.org/loo
Other
149 stars 34 forks source link

Use JuliaCall to Speed Up PSIS #177

Open ParadaCarleton opened 3 years ago

ParadaCarleton commented 3 years ago

Julia is generally a much faster language than R, especially when doing lots of computations. Given that JuliaCall lets you use Julia very easily from within R, I'm wondering if it would be a good idea to recode the more computationally intensive parts of PSIS in Julia. I can fire up a pull request if @avehtari or someone else points me at the parts of the package that would benefit most from being rewritten in Julia.

yao-yl commented 3 years ago

What is the main scalability bottleneck for LOO in practice? I am asking because LOO typically relies on MCMC draws, and my understanding is that the LOO step is often much faster compared with the full MCMC step.

ParadaCarleton commented 3 years ago

What is the main scalability bottleneck for LOO in practice? I am asking because LOO typically relies on MCMC draws, and my understanding is that the LOO step is often much faster compared with the full MCMC step.

Relative to the model, PSIS is definitely a lot faster. Relative to the model is explicitly pretty relative, though; for large models, PSIS can take a while. Also worth noting that a couple Julia packages already want to implement PSIS themselves anyways, so the task of implementing PSIS in Julia is going to get done sooner or later by someone; the effort won't really be wasted. We might as well use it to speed up our own version of LOO.

ParadaCarleton commented 3 years ago

I also do think writing in Julia is an easier way to get the improvements requested in #33 given that Julia code is usually easier to understand/write than C++ code; Julia code can also be used by Julia packages like Arviz.jl that are interested in reimplementing this.

jgabry commented 3 years ago

I think using Julia instead of C++ is an interesting idea. I certainly agree that it's easier to understand/write Julia code than C++ code (even given my limited experience with Julia). And even though MCMC is the main bottleneck, it's still the case that the loo/psis calculations can be annoyingly slow.

I think the main thing we need before doing this is for someone to do some detailed profiling of the R code to figure out precisely where the bottlenecks are. For example, I bet some of the code in https://github.com/stan-dev/loo/blob/master/R/gpdfit.R is pretty slow, but it's been a while since I actually ran a profiler on it and I don't remember if other parts of the codebase are even slower.

ParadaCarleton commented 3 years ago

Worth noting that most of this is already implemented as part of a Statistical Rethinking package and can be found here. I think they'd be cool with us reusing their code if we ask them, and I can figure out how to hook it up to R.

MansMeg commented 3 years ago

Hi,

I think Julia is super-cool. A downside with including it would though be that, if Im not wrong, Julia would become a dependency for the loo package. This might complicate both maintenance for the developers and useability for the large number of current users. Using C++ would not introduce the same complications for users (but might add maintenance burden).

Sorry for being hesitant, but I have some negative experiences with adding other languages (also C++) in R packages. Of course, this is totally up to @jgabry , but I agree with his notion that profiling is probable the best next step if speed is the goal.

Although, please correct me if Im wrong.

rok-cesnovar commented 3 years ago

Does Julia code also compile on CRAN for Windows/Mac binaries like it does for C++ code?

ParadaCarleton commented 3 years ago

Hi,

I think Julia is super-cool. A downside with including it would though be that, if Im not wrong, Julia would become a dependency for the loo package. This might complicate both maintenance for the developers and useability for the large number of current users. Using C++ would not introduce the same complications for users (but might add maintenance burden).

Sorry for being hesitant, but I have some negative experiences with adding other languages (also C++) in R packages. Of course, this is totally up to @jgabry , but I agree with his notion that profiling is probable the best next step if speed is the goal.

Although, please correct me if Im wrong.

On the developer end, I think Julia is easier to maintain than C++ (and even R!) because of its features, like improved readability. On the user end, installing Julia isn't especially difficult (Especially compared to installing/troubleshooting C++ compilers). I'm not certain, but it might also be possible to remove even this requirement by using PackageCompiler.jl, which precompiles

Does Julia code also compile on CRAN for Windows/Mac binaries like it does for C++ code?

Usually, Julia is JIT compiled -- a function is compiled the first time you run it, so that it can be optimized specifically for the types you call it on, letting it run faster. As a result, binaries probably wouldn't be available; however, it might be possible to add them with PackageCompiler.jl.

paul-buerkner commented 3 years ago

I understand and share your excitment for Julia, but this feature would, as far as I understand, require Julia not only for loo itself but also for its various downstream packages (see https://cran.r-project.org/web/packages/loo/index.html). So would you suddenly require tens of thousands of users to install Julia. I agree that C++ compilers are tedious, but CRAN precompiles the packages anyway for anything that is not Linux. So there is no actual requirement for a C++ compiler from the user side just because of potential new C++ code in loo.

Of course, this is totally up to @jgabry but, based on my current understanding of the situation (please correct me if I am wrong) I would be opposed to that dependency.

ParadaCarleton commented 3 years ago

I've verified that PackageCompiler.jl would allow for using Julia code without making Julia a dependency by precompiling the code into C. After the code is precompiled, it can be used directly in Julia with .Call(). Precompiled binaries should be possible since Julia compiles to C, which is precompiled in CRAN.

It's also worth noting that if we don't want to mess around with PackageCompiler.jl and instead want to take advantage of Julia's JIT improvements (which can sometimes optimize code better for specific arguments), this wouldn't add any difficulties for users. We can check whether the user has Julia installed, and if not, install it, with two lines of code that can be run when library(loo) is called:

library(JuliaCall)
julia_setup(installJulia = TRUE)

Of course, PackageCompiler.jl is still the preferable option, since it doesn't require any Julia installation and removes the JIT compilation penalty by allowing for precompilation.

jgabry commented 3 years ago

I'm feeling a bit under the weather today and will try to respond more tomorrow or the next day, but just wanted to thank everyone for chiming in here. I consider this a decision that all loo package authors should make together, so I certainly don't plan on making any unilateral decision on this.

It does occur to me that one potential compromise here would be to branch on whether the user has Julia installed:

That way we don't force anyone to install Julia but can still provide the benefits to anyone who does want to install it.

Thoughts on that option?

johnpauls commented 3 years ago

Thanks for making this great package. I just wanted to mention that for some folks, our employers pretty strictly regulated what can and cannot be installed on work computers. So you might be fine using R, but not be able to install Julia as a separate stand alone. It would be good if a Julia installation was not required, but rather just optional.

ParadaCarleton commented 3 years ago

I'm feeling a bit under the weather today and will try to respond more tomorrow or the next day, but just wanted to thank everyone for chiming in here. I consider this a decision that all loo package authors should make together, so I certainly don't plan on making any unilateral decision on this.

It does occur to me that one potential compromise here would be to branch on whether the user has Julia installed:

  • if Julia is installed we can use the faster computations (presumably this would just be in a few places, we wouldn't be rewriting the entire codebase)
  • if Julia is not installed we can fall back to using the existing R code

That way we don't force anyone to install Julia but can still provide the benefits to anyone who does want to install it.

Thoughts on that option?

I think this is a great idea if it turns out to be necessary. That being said, it should be possible to compile the Julia down to C using PackageCompiler.jl and then use .Call(), which would avoid adding any new requirements. This also saves users time waiting for Julia to compile the same code every time they want to use LOO (Since Julia is JIT compiled).

I've also asked the people who built the LOO-CV code in the Statistical Rethinking package, and they've said they're good with us using it if we decide we want to.

MansMeg commented 3 years ago

I agree Jonah's suggestion is a good way forward. I totally agree that if the Julia code can easily be compiled to C code, I think that is probably a good idea for the package on CRAN.

For us other users I think Jonah's dual idea is good. It would mean that people working with the dev versions of the loo package would not need to have Julia installed. This would also consider @johnpauls comment.

My only worry is the potential maintenance burden by adding Julia. I cannot judge the extra burden in the CRAN workflow to compile the package before submission to CRAN, long and short term.

But, it would be super-interesting to see how large the performance gain could be!

paul-buerkner commented 3 years ago

just a quick note that CRAN does not accept precompiled code for safety reasons. anything needs to be compiled by CRAN itself.

anyone using the Julia code without Julia in a CRAN package already?

Måns Magnusson @.***> schrieb am Do., 3. Juni 2021, 17:53:

I agree Jonah's suggestion is a good way forward. I totally agree that if the Julia code can easily be compiled to C code, I think that is probably a good idea for the package on CRAN.

For us other users I think Jonah's dual idea is good. It would mean that people working with the dev versions of the loo package would not need to have Julia installed. This would also consider @johnpauls https://github.com/johnpauls comment.

My only worry is the potential maintenance burden by adding Julia. I cannot judge the extra burden in the CRAN workflow to compile the package before submission to CRAN, long and short term.

But, it would be super-interesting to see how large the performance gain could be!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/loo/issues/177#issuecomment-853977008, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2ACM4LFRI6CSI4BAYQTTQ6QO3ANCNFSM45VJXU7A .

paul-buerkner commented 3 years ago

And in response to @jgabry comment, I agree that, if we add a Julia version of PSIS that turns out to be much faster, we should continue to have an R version so that Julia code is not required for loo to fully work.

ParadaCarleton commented 3 years ago

It looks like PackageCompiler.jl might not work for our specific use case, so I think having both R and Julia versions is the best way to move ahead with this.

ParadaCarleton commented 3 years ago

I've built the code to fit the Generalized Pareto distribution in Julia; let me know if/how/where you'd like to see it.

jgabry commented 3 years ago

I've built the code to fit the Generalized Pareto distribution in Julia; let me know if/how/where you'd like to see it.

Awesome, thanks! Do you have it on GitHub somewhere?

ParadaCarleton commented 3 years ago

I've built the code to fit the Generalized Pareto distribution in Julia; let me know if/how/where you'd like to see it.

Awesome, thanks! Do you have it on GitHub somewhere?

The code fitting the distribution is here, but I haven't built any code hooking it up to anything else since I'm still trying to figure out the internals of the LOO package.

jgabry commented 3 years ago

The code fitting the distribution is here, but I haven't built any code hooking it up to anything else since I'm still trying to figure out the internals of the LOO package.

Cool, thanks.

Regarding figuring out the internals of the loo package, if you want we can go over that when we have our next call and I can answer any questions you have or just give you a general overview of the internals.

Regarding the Julia implementation for fitting the GPD, if it returns the same result as the R version then I think the next step would be to do some speed comparisons to see how much an efficiency improvement there is. One option would be to use the R package microbenchmark, although there are a bunch of other ways to do it too.

ParadaCarleton commented 3 years ago

The code fitting the distribution is here, but I haven't built any code hooking it up to anything else since I'm still trying to figure out the internals of the LOO package.

Cool, thanks.

Regarding figuring out the internals of the loo package, if you want we can go over that when we have our next call and I can answer any questions you have or just give you a general overview of the internals.

Regarding the Julia implementation for fitting the GPD, if it returns the same result as the R version then I think the next step would be to do some speed comparisons to see how much an efficiency improvement there is. One option would be to use the R package microbenchmark, although there are a bunch of other ways to do it too.

Small note -- since I used Julia's quantile function to get the first quartile, the answers will be very slightly different because it's using a different quantile interpolation. (It's the same algorithm as quantile(type = 6) in R, rather than the slightly weird choice of x[floor(n/4 + .5)] which left me scratching my head for a bit). It shouldn't make a difference in practice, but it makes the code more readable; however, it might set off a couple errors if you set the approximation error in your tests very low.

jgabry commented 3 years ago

That seems fine, I doubt it matters in practice. I forget why we used x[floor(n/4 + .5)]. @avehtari do you remember if we chose to use that instead of using quantile() for a particular reason?

ParadaCarleton commented 3 years ago

That seems fine, I doubt it matters in practice. I forget why we used x[floor(n/4 + .5)]. @avehtari do you remember if we chose to use that instead of using quantile() for a particular reason?

I believe it was since the code from the original paper by Zhang used that interpolation. Amusingly, I've seen other papers by Zhang using similar slightly unusual continuity corrections, so I'd give even odds whether he knows about the quantile() function or is just very specific in his preferences about continuity corrections. (Although I'm clearly not one to talk -- I'm apparently quite stubborn in always using the unbiased ECDF, since I chose to use that option instead of the default mode-based option :p )

jgabry commented 3 years ago

I believe it was since the code from the original paper by Zhang used that interpolation.

Good point. I think that's right.

avehtari commented 3 years ago

I forget why we used x[floor(n/4 + .5)]. @avehtari do you remember if we chose to use that instead of using quantile() for a particular reason?

It's because I implemented Zhang et al algorithm exactly, and the only change I made was the vectorization. I would not change that in our code as there is no speed difference and for someone reading Zhang et al its meaning is clear. If we change the code, we would need to mention also in the documentation why the results differ and include in the code explanation why the code differs. If there is a concern that someone who is reading just the code and not the paper would not understand that line, instead of changing the code it would be sufficient to add a comment that it gets the first quartile and then there would not be need to change the rest of the documentation.

I've seen other papers by Zhang using similar slightly unusual continuity corrections

I would assume good continuous corrections are different for different distributions and quickly thinking this looks reasonable for GPD. Have you seen other continuity corrections for extreme value distributions?

ParadaCarleton commented 3 years ago

I forget why we used x[floor(n/4 + .5)]. @avehtari do you remember if we chose to use that instead of using quantile() for a particular reason?

It's because I implemented Zhang et al algorithm exactly, and the only change I made was the vectorization. I would not change that in our code as there is no speed difference and for someone reading Zhang et al its meaning is clear. If we change the code, we would need to mention also in the documentation why the results differ and include in the code explanation why the code differs. If there is a concern that someone who is reading just the code and not the paper would not understand that line, instead of changing the code it would be sufficient to add a comment that it gets the first quartile and then there would not be need to change the rest of the documentation.

I can remove the change if you'd like, but I think this only requires changing one line in the documentation and it does seem to give more accurate results. I experimented with it and using quantile() seems to provide a reduction in the MSE of about .5% when using samples of size 1000 and a GPD(.5, 1).

I've seen other papers by Zhang using similar slightly unusual continuity corrections

I would assume good continuous corrections are different for different distributions and quickly thinking this looks reasonable for GPD. Have you seen other continuity corrections for extreme value distributions?

It looks reasonable enough; the comment was me slightly poking fun at both him and myself for having a specific enough taste in continuity corrections to not just use the default provide by quantile().

ParadaCarleton commented 3 years ago

In any case, having benchmarked the Julia code, a run of gpdfit() in Julia takes about a third as long on average as in R when fitting samples of size 1000 taken from a GPD with a shape parameter of .5 and a scale of 1. image

avehtari commented 3 years ago

but I think this only requires changing one line in the documentation

It's implemented and documented in more than one place, so there is more work.

and it does seem to give more accurate results.

How did you test that? When I compared the current implementation to full Bayesian inference with Stan, I didn't see any meaningful difference.

a run of gpdfit() in Julia takes about a third as long on average as in R when fitting samples of size 1000 taken from a GPD with a shape parameter of .5 and a scale of 1.

1000 for gpdfit corresponds to a total sample size of 1.11e5 which would be an unusually big sample size for Stan especially if the data is so big that loo computations are expected to be slow. The default Stan sample size 4e3 would make the gpdfit only for 190 draws. What is the timing difference in that case? And how much is the gpdfit computation from the whole loo computation?

ParadaCarleton commented 3 years ago

but I think this only requires changing one line in the documentation

It's implemented and documented in more than one place, so there is more work.

Got it; I'll switch it back.

and it does seem to give more accurate results.

How did you test that? When I compared the current implementation to full Bayesian inference with Stan, I didn't see any meaningful difference.

I just ran a loop that generated 2^12 samples, then calculated and compared the mean squared errors from each algorithm using a different quantile interpolation. When I did this, I got a difference in MSE of ~0.5% for the estimates of the shape parameter.

a run of gpdfit() in Julia takes about a third as long on average as in R when fitting samples of size 1000 taken from a GPD with a shape parameter of .5 and a scale of 1.

1000 for gpdfit corresponds to a total sample size of 1.11e5 which would be an unusually big sample size for Stan especially if the data is so big that loo computations are expected to be slow. The default Stan sample size 4e3 would make the gpdfit only for 190 draws. What is the timing difference in that case? And how much is the gpdfit computation from the whole loo computation?

If I recall correctly, gpdfit takes up about half the total time used by LOO, but I didn't benchmark that personally -- Jonah showed me a profile run of a LOO fit. Sample size doesn't make much of a difference, with Julia taking about a quarter of the time it takes to run R when using samples of size 64 or 256:

image

ParadaCarleton commented 3 years ago

Whoops, there was a small error in the code that found a .5% improvement in MSE; fixing that bug makes the two methods return essentially identical results.

ParadaCarleton commented 3 years ago

Some additional small improvements to the code have gotten it down to taking about 14% as long as R. The two methods now return identical results. I think this is about as far as I can push the code -- any additional improvements are likely to be minor.

jgabry commented 3 years ago

gotten it down to taking about 14% as long as R

Thanks for doing the benchmarking. That's a pretty huge difference and it's good to know that the difference persists even with small sample sizes. I think the next step is now to figure out how much of a speedup this gives us when running the full loo(). I think it will be substantial because gpdfit is one of the slower parts of the code, but if we're going to potentially add this to the package it would be good to know (also if we add this it will be good to have numbers to use when announcing the faster version).

jgabry commented 3 years ago

I think the next step is now to figure out how much of a speedup this gives us when running the full loo().

To be clear, by this I mean swapping out the gpdfit for the julia version but keeping the rest of the current loo() code, not rewriting all of loo() in julia.

ParadaCarleton commented 3 years ago

I think the next step is now to figure out how much of a speedup this gives us when running the full loo().

To be clear, by this I mean swapping out the gpdfit for the julia version but keeping the rest of the current loo() code, not rewriting all of loo() in julia.

Looks like it takes about 33% less time. That being said, I think there's a lot of possible speedups here from doing more of this stuff in Julia. ATM there's a lot of overhead from calling a Julia function for each data point rather than just calling a single Julia function, and from copying arrays over to Julia. I'm looking at whether it's possible to deal with this by using Arrow, rather than having to copy arrays each time a function is called; even if not, then just copying a single array once would still be a pretty big improvement.

jgabry commented 3 years ago

Looks like it takes about 33% less time.

Nice, that's pretty good!

How does everyone feel about adding the option to use Julia if the speedup is about 33%? Is that enough to warrant maintaining the extra code going forward? It also sounds like @ParadaCarleton has some ideas for getting an even larger speedup, but we can start thinking about what the minimum speedup is for which we'd be willing to add the julia dependency and additional code maintenance.

paul-buerkner commented 3 years ago

Nice! I would say how much speedup is required depends on how much code is added and how difficult that is to maintain. Perhaps, we would need to see a PR for this. Doesn't need to be polished just give us some rough idea how much is changing.

ParadaCarleton commented 3 years ago

I think the next step is now to figure out how much of a speedup this gives us when running the full loo().

To be clear, by this I mean swapping out the gpdfit for the julia version but keeping the rest of the current loo() code, not rewriting all of loo() in julia.

I did end up rewriting all of loo() in Julia -- though mostly not for speed improvements in the R version (although that's a big bonus), just since there's not a PSIS-LOO package in Julia at the moment.

Nice! I would say how much speedup is required depends on how much code is added and how difficult that is to maintain. Perhaps, we would need to see a PR for this. Doesn't need to be polished just give us some rough idea how much is changing.

So, depends on what you mean. On the R side, I'll put up a PR in a couple days (when the package has been registered) which will add something like 5 or 6 lines of code. On the Julia side, essentially all the code for the loo() function has been rewritten, and I'll be providing it as a new package for PSIS and LOO in Julia. I'll be handling maintenance for that package, and if needed I'm sure I can find someone on the Turing or ArviZ.jl teams willing to step in.

jgabry commented 3 years ago

I did end up rewriting all of loo() in Julia -- though mostly not for speed improvements in the R version (although that's a big bonus), just since there's not a PSIS-LOO package in Julia at the moment.

Oh that's awesome. It will great to have this available as a Julia package too!

So, depends on what you mean. On the R side, I'll put up a PR in a couple days (when the package has been registered) which will add something like 5 or 6 lines of code.

Sounds great, thanks Carlos!

ParadaCarleton commented 3 years ago

An update -- I tested this, and I don't believe this'll work. The main problem is the startup times are very large (almost a full minute on a very good laptop when trying to load everything into R), and PSIS-LOO is too fast for this to be worth it, even if an iteration of PSIS-LOO is three times as fast in the Julia code. The good news is that none of the work I've put into this over the past month has gone to waste -- I've built ParetoSmooth.jl, a great Julia package implementing most of the major features of the LOO package. This has created some pretty major performance improvements for people working in Julia. It's fully compatible with the Turing PPL, and should be fully compatible with the Stan interfaces to Julia by the end of the week. I'd encourage anybody who still wants to use this from R to take a look at JuliaCall, which would let you use ParetoSmooth.jl from Julia.

jgabry commented 3 years ago

Ok no worries, I'm glad to hear that the work wasn't a waste. It will be great to have ParetoSmooth.jl for Julia users!

On Wed, Jul 21, 2021 at 7:54 PM Carlos Parada @.***> wrote:

An update -- I tested this, and I don't believe this'll work. The main problem is the startup times are very large (almost a full minute on a very good laptop when trying to load everything into R), and PSIS-LOO is too fast for this to be worth it, even if an iteration of PSIS-LOO is three times as fast in the Julia code. The good news is that none of the work I've put into this over the past month has gone to waste -- I've built ParetoSmooth.jl https://github.com/ParadaCarleton/ParetoSmooth.jl, a great Julia package implementing most of the major features of the LOO package. This has created some pretty major performance improvements for people working in Julia.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/loo/issues/177#issuecomment-884607060, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3PQQ4L3PE3WLCSNLBOAXDTY526JANCNFSM45VJXU7A .

ParadaCarleton commented 3 years ago

Ok no worries, I'm glad to hear that the work wasn't a waste. It will be great to have ParetoSmooth.jl for Julia users!

Yup, and I'm even trying to add some new features that aren't in loo, like first-class support for the Bayesian bootstrap as a model comparison tool. Trying to improve bias problems by testing out the double boostrap, or drawing from a KDE instead of the observed sample itself. I haven't seen either of these in the literature, but they're not particularly complicated extensions. Although I'm noticing this will definitely require split-moment matching, since my current implementation seems to result in a lot of bad Pareto k values. Which is a good thing I suppose, the point of BB instead of LOO is to get more samples that are unusually different from the actual sample, rather than points differing from the actual sample by at most 1 data point.

Mentioning @avehtari in case he has any comments, since I think he said something about this in another thread. Especially interested in good density estimation, which I know very little about ATM. I'm thinking I'll probably use a KDE that has an unbiased variance, like my old stats prof taught me -- add normally-distributed noise equal to σ/sqrt(n) where n is the sample size -- but I'd definitely interested in anything that maintains a mostly-unbiased entropy instead, since I think entropy is the more relevant quantity (I want samples from my KDE to be about as predictable as samples from the real distribution). Don't really know how I'd do this, though, and I know unbiased estimation of the entropy isn't actually possible.

avehtari commented 3 years ago

I think the discussion about new features should happen somewhere else than in this issue. If this is specific to a Julia package you can invite me to discuss in a forum related to that package. If it's about algorithms in general, Stan Discourse or Zulip are fine for me. Based on the above, it's not completely clear what you are doing. @topipa did BB of data with split-moment mtaching, but BB of data didn't provide anything useful compared to other diagnostics. I would not use KDE as it adds smoothing prior information that can cause non-negligible bias, but maybe I misunderstood what you are trying to do.

ParadaCarleton commented 3 years ago

Yep, makes sense, I'll close this.

ParadaCarleton commented 1 year ago

Just a brief comment--with native code caching in Julia 1.9, this should be viable now if someone else wants to give it a shot. I'm pretty busy so I probably won't be able to help, but startup times should be much faster (just a few seconds rather than the several minutes it used to take for a package to load).