roualdes / bridgestan

BridgeStan provides efficient in-memory access through Python, Julia, and R to the methods of a Stan model.
https://roualdes.github.io/bridgestan
BSD 3-Clause "New" or "Revised" License
88 stars 12 forks source link

Minor edits to JOSS paper and adding self as coauthor #158

Closed sethaxen closed 1 year ago

sethaxen commented 1 year ago

This PR makes a number of minor corrections to grammar, punctuation, and a handful of phrasings.

As suggested by @roualdes, I also added myself as a coauthor and listed my affiliation and funding.

sethaxen commented 1 year ago

A minor nitpick is that I don't completely agree with this statement:

Existing tools with similar automatic differentiation functionality
include `JAX` [@Bradbury:2018] and `Turing.jl` via the `JuliaAD`
ecosystem [@Ge:2018].  `BridgeStan` differs from these tools by
providing access to the existing, well-known DSL for modeling and
highly efficient CPU computation of the Stan ecosystem.  The Stan
community predominantly uses CPU hardware, and since Stan has been
tuned for CPU performance, `BridgeStan` is more efficient than its
competitors in implementing differentiable log densities on CPUs
[@Carpenter:2015; @Radul:2020; @Tarek:2020].

I don't know that Stan outperforms other ADs because it's optimized for CPU hardware. Similar optimizations have I'm sure been undertaken for other ADs. I've always thought that Stan outperformed other PPLs in AD because it is as far as I know the only PPL that has its own dedicated AD. That combined with a minimal library of functions and a small set of applications (specifically, HMC, ADVI, and L-BFGS) means it can be fine-tuned to be very efficient at these tasks. Most PPLs are built on ML stacks, where the ADs were fine-tuned for ML applications, and the PPL just has to take what it gets. In Julia, ADs are generally designed to work well on both GPU and CPU but make different trade-offs. Because the ADs try to support huge subsets of the language, they are in general not co-optimized with the PPL, though this is certainly possible.

roualdes commented 1 year ago

It sounds like you only really disagree with the last sentence, ya? There probably does exist better language and I'm open to re-phrasings, if you would suggest one.

minimal library of functions

That's a bit of stretch, in my opinion, since JAX doesn't play nice with for-loops -- although can be convinced otherwise, and Turing.jl doesn't fully play nice with in-place mutations, although can be convinced otherwise. Stan's AD happily deals with for-loops and, I believe, mutations, thus extending the set of functions it could be used for.

On the other hand, Stan doesn't play nice with changing data across gradient evaluations. So it is certainly design choices.

small set of applications

BridgeStan is a good attempt at helping this!

bob-carpenter commented 1 year ago

I don't know that Stan outperforms other ADs because it's optimized for CPU hardware.

Maybe a better way to say this is that Stan is heavily tuned for CPU hardware, whereas other autodiff systems like PyTorch and TensorFlow and JAX have concentrated on optimizing for GPU hardware. Specifically, they have a huge amount of overhead for doing simple things compared to what we do which pays off when things scale up to GPUs and out to multiple cores.

Similar optimizations have I'm sure been undertaken for other ADs.

Maybe JAX. I don't know much about Julia's many autodiff systems. You can see our evaluation paper that shows the other C++ autodiff systems are pretty poorly optimized. Google put out a report a couple years ago where Stan just crushed TensorFlow on CPU and vice-versa for GPU.

I've always thought that Stan outperformed other PPLs in AD because it is as far as I know the only PPL that has its own dedicated AD.

We're not exploiting the connection in any way. We literally just call the top-level AD functional. We even compile our inference routines in separate translation units.

That combined with a minimal library of functions and a small set of applications (specifically, HMC, ADVI, and L-BFGS) means it can be fine-tuned to be very efficient at these tasks.

I would say we have a more maximal set of functions compared to the big ML autodiff systems. As I say, I don't know much about Julia, but their devs are always saying they have a massive autodiff library. In particular, I don't know how to figure out which Julia functions would work in a Julia autodiff system like Zygote and which wouldn't. Or which would be efficient.

Most PPLs are built on ML stacks, where the ADs were fine-tuned for ML applications, and the PPL just has to take what it gets.

This is what I'm saying. The fine-tuning for ML applications is tuning for massive matrix operations on the GPU. TensorFlow and PyTorch are not so good at ML on the CPU. I'm less clear on how far JAX can be pushed on CPU.

In Julia, ADs are generally designed to work well on both GPU and CPU but make different trade-offs. Because the ADs try to support huge subsets of the language, they are in general not co-optimized with the PPL, though this is certainly possible.

Stan's language and algorithms are not co-optimized with the PPL in any meaningful way. To repeat myself, we compile them in separate translation units.

I think part of our optimization is that we are good at doing things like dropping constant terms using template traits. We use a lot of template traits programming for that.

I haven't done a thorough comparison, but @roualdes originally built BridgeStan because Julia's AD was so slow compared to Stan's. At least in the way we're using it.

We also have a disadvantage against many of the systems that run in single precision, but I'm assuming Julia's mainly running in double precision, too.

WardBrian commented 1 year ago

Similar optimizations have I'm sure been undertaken for other ADs

I don't know of any other system which has done the kinds of optimizations we have done for matrix functions with respect to memory/cache locality. Granted, these kinds of optimizations are only available to us because we have much finer control over the language implementation, so your point is definitely also correct. I believe they're also fairly specific to the kind of dynamic, tape-based AD Stan does, so they would not be available to e.g. Jax.

Is there an easy line edit to acknowledge this? I'm thinking something like

The Stan community predominantly uses CPU hardware, and since Stan's automatic differentiation and language have been co-optimized for CPU performance, BridgeStan is more efficient than its competitors in implementing differentiable log densities on CPUs

But that might be a tad clunky

bob-carpenter commented 1 year ago

This isn't worth getting hung up on, but I have a bunch of issues with this turn in language. Feel free to ignore this and say whatever. I personally tend to disregard authors' comments on their own system's performance.

I will stand by my statement that the big difference in performance is that PyTorch and TensorFlow were designed to optimize large-scale GPU operations, not ad-hoc CPU calculations. One way to show this if you really care would be to profile CPU vs. GPU calculations in Stan math, Tensorflow and PyTorch independently of a PPL.

The cleverness in Stan of turning off constant calculations has nothing to do with our autodiff. It's turned off with traits-based meta program branching in the generated code in our transpiler. It wouldn't matter what autodiff system we used---we'd still get the advantage of dropping constant terms.

Another way of seeing this is that if we plugged in a CPU-friendly autodiff system like Adept or Sacado, it'd still beat TensorFlow on CPU (I have no idea how well PyTorch performs on CPU, but I imagine like TensorFlow, it wasn't really optimized there).

Also, I don't see that we're doing any more "co-optimization" with Stan math than a project like PyMC is doing with their fork of Theano. Or that ADMB/TMC does with CppAD. Both of those projects also control their autodiff packages.

Finally, I don't see that our var-mat is anything special for performance in that it's the standard way a system like JAX represents matrix autodiff. You might want to say Stan's a bit more expressive because we also allow mat-var types, but var-mat is what gives you performance. I'm also not sure we're going to be any faster than PyMC with a JAX back end---I just can't figure out the PyMC doc well enough to tell what can run with what (they have several competing back ends now, much like the Julia offerings) or what I have to install to try everything.

WardBrian commented 1 year ago

I’d also be happy to just leave the current language, the JOSS reviewers seemed happy with it as is

aseyboldt commented 1 year ago

The Stan community predominantly uses CPU hardware, and since Stan has been tuned for CPU performance, BridgeStan is more efficient than its competitors in implementing differentiable log densities on CPUs [@Carpenter:2015; @Radul:2020; @Tarek:2020].

That sentence also raised my eyebrows a little. Overall my impression from what I've seen where pymc and stan are concerned is that the variance between models and their implementations and blas settings is much higher than overall differences between the libraries. I think if I wanted to come up with benchmarks that showed that stan or pymc are faster I don't think I'd have a hard time doing that for either (although long-term I think it might be hard to keep up with rewrites in MLIR, even on the CPU)...

But the overall point of CPU vs GPU makes a lot of sense to me. Models that can take advantage of GPUs as things currently stand certainly exist, but I don't think they are anywhere near the majority, and I don't mind a bit of pride in the optimizations stan has either. :-)

Also, as suggested by @roualdes I made a commit to add myself as author (either someone can pull or I can make a PR once this here is merged https://github.com/aseyboldt/bridgestan/tree/joss_edits). I also added rust to the list of language interfaces in a few places.

roualdes commented 1 year ago

I’d also be happy to just leave the current language, the JOSS reviewers seemed happy with it as is

I too am in favor of just leaving the current language. So if there's no objections by the end of the day eastern time, then I'm moving on with the current language.

sethaxen commented 1 year ago

I'm traveling and will not be able to read or reply to all comments quickly. As I said, this is to me a minor issue. Since all other authors and the reviewers have approved the language, and there seem to be time constraints, I'm okay with leaving it as is.

WardBrian commented 1 year ago

@aseyboldt thanks for your branch. I think you're missing an affiliation for yourself. Also, I'm an Oxford-comma absolutist, and a few of the places you added Rust to a list of languages ended up removing those commas

WardBrian commented 1 year ago

I think we're all happy with @sethaxen's edits here, so I'm going to merge this