stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.59k stars 371 forks source link

General performance improvements discussion #3062

Open blackwer opened 3 years ago

blackwer commented 3 years ago

Summary:

At the suggestion of @bob-carpenter, I'm opening this as a discussion thread for how to improve the performance of Stan both generally, and on specific platforms. These improvements can vary between simple compile and run-time tricks the user can perform in certain environments which can be addressed with documentation and basic automation, to deep overhauls of the internals to exploit modern processor architecture. While my experience with the code is minimal currently, it seems Stan is already at a very good baseline for performance, but there is likely room for improvement (such as exploiting vectorization better and improving special function performance, for starters).

Description:

Stan is great, and I want to help make it better. This issue is a place to discuss how to best do so, and provide various benchmarks on common systems I have access to see where performance is lagging. The main machines I have readily available for trials are: AMD Ryzen 5800X (Ubuntu 21.04, Windows 10) 2X AMD EPYC 7742 64-Core Processor (CentOS 7) 2X Intel(R) Xeon(R) Gold 6128 CPU (CentOS 7) 2X Intel(R) Xeon(R) CPU E5-2680 v4 (CentOS 7) 2X Intel(R) Xeon(R) Gold 6148 CPU (CentOS 7) Fujitsu A64FX (CentOS 8) ThunderX2 99xx (CentOS 8) Intel(R) Core(TM) i9-10885H CPU (Ubuntu 21.04) Intel(R) Core(TM) i7-6920HQ (Big Sur) Apple M1 (Big Sur)

Benchmarking tools:

I've been using these programs to do some simple benchmarks. simple cmdstan benchmark compare amdlibm to libm, need to extract amd-libm somewhere and add the library to library path as in the example script.

Results

This is currently mostly a more qualitative stub, with more results to come later. I'm only using cmdstan so far. On machines running CentOS 7, glibc is painfully out of date. From a practical standpoint, this means that libm (log, exp, etc.) is also incredibly out of date (2.17). AMD provides a performance tuned version of libm, which honestly is approximately the speed of current libc implementations (given their permissive license, the code might have been absorbed into glibc). However, it absolutely crushes old libc versions.

./test_libm.sh 10
Model name:            Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
GNU libc version: 2.17
GCC version 7.4

func    libm.so libalm.so
log 0.3302  0.0555
exp 0.1533  0.0503
log1p   0.1225  0.0994
cos 0.1308  0.0613
sin 0.1379  0.0570
acos    0.1903  0.2063
asin    0.1731  0.2296
atan    0.2069  0.1938
exp10   0.2888  0.0562
exp2    0.1047  0.0518
log10   0.3369  0.0671
log2    0.0930  0.0605
cosh    0.1750  0.2473
sinh    0.1720  0.2661
tanh    0.1916  0.1154
expm1   0.1063  0.0997

Other notes:

  1. The code spends a non-negligible amount of time pushing things to the vari_base* stack. This improves with more modern compilers, but is still probably ~5% of execution time on this example for the x86_64 machines. It could be better or worse, it's kinda painstaking to isolate because the stack is modified everywhere.
  2. Any BLAS implementation can be swapped in. This can be useful for conda packages, since I doubt you want to auto-pull and build openblas/blis/mkl/whatever.
  3. There are likely techniques for more rapid evaluation of heavy 1-dimensional functions, which can really hurt performance on certain problems. I have one such utility for this, but it's on a limited domain/range. There are still probably very good ways to re-tool it to handle these problems.
  4. apply_scalar_unary can probably be vectorized for many functions. Vector instructions are currently under-utilized in my limited profiling, but I will investigate this more

Current suggestions

It might be worth detecting the libc implementation (when appropriate) and suggesting the AMD implementation be loaded. This could also be pulled in directly, given their permissive license. The make/local I'm using in cmdstan contains LDFLAGS+= -L/mnt/home/rblackwell/tmp/amd-libm/lib -Wl,-rpath,"/mnt/home/rblackwell/tmp/amd-libm/lib" -lalm -lm to enable this feature, but this could be changed to $(PWD) or whatever the make equivalent is.

On the same machine/OS, I got significant performance bumps from using a more modern compiler (gcc10 vs gcc7), and swapping the Eigen backend out for MKL, both of which were relatively trivial. The combined effect of these three things cut execution time down from ~75s to ~45s, and it was all very low hanging fruit. I didn't see any obvious mention of this in documentation. The same modifications on my modern Ryzen system didn't do much of anything though (I didn't try an older compiler, MKL and libalm didn't help noticeably). The code executed in about 28-30s, depending on configuration.

Current Version:

v2.27.0

syclik commented 3 years ago

Awesome! This is great and it all makes sense (using updated versions of libraries, especially those tuned to the system it's running on).

Where can we help with the discussion? I saw the current suggestion, but not sure where what could actually be configured as $(PWD) (or the make equivalent).

Maybe it'd help to know exactly what things you're suggesting?

One note: we've tested MKL in the past. It turns out that the implementation gets loose with IEEE floating point rules and although the wall time is faster, we found that:

  1. a lot of our tests fail due to numeric issues
  2. due to the numerical error, the effective sample size per unit time ended up being lower than using the current implementation even though the per iteration time was faster.

I'm wary of things that are faster that produce different results. It requires a bit more effort to sort through what that means and actually assess whether it's faster.

blackwer commented 3 years ago

Where can we help with the discussion? I saw the current suggestion, but not sure where what could actually be configured as $(PWD) (or the make equivalent).

Maybe it'd help to know exactly what things you're suggesting?

I made an example. Unfortunately I had to directly include the binary libalm.so object, since they don't do github releases from their source and they don't provide direct download. https://github.com/blackwer/cmdstan/commit/603cdd48d89f9764dee6bfeb433a60da18af01ef

This checks, when running linux, if glibc is older than some threshold, and automatically links in libalm.so if glibc is older than that threshold. There's lots of room for improvement, so it's more a proof of concept. For example, I wrote that simple benchmarking program, and a decision could just be made based on the output of that. I'm not sure how much you want to automate this kind of stuff in the build process, since it adds complexity and sometimes fragility. Simply detecting that things are slow and warning the user and providing a link to a simple remedy might be reasonable. There's also means to directly hotswap out certain calls in the C++ code itself, rather than doing anything at build time, though this is considerably more complicated.

One note: we've tested MKL in the past. It turns out that the implementation gets loose with IEEE floating point rules and although the wall time is faster, we found that:

1. a lot of our tests fail due to numeric issues

I've had this experience with dgemm et. al. on other projects. I never verified, but I thought it had more to do with operation ordering. My answers were always different comparing Eigen to MKL, but not necessarily more or less right. That said...

2. due to the numerical error, the effective sample size per unit time ended up being lower than using the current implementation even though the per iteration time was faster.

That's interesting. I should really investigate this further for other projects. This isn't really a problem for me, but I should really try to figure out why they give different results once and for all.

I'm wary of things that are faster that produce different results. It requires a bit more effort to sort through what that means and actually assess whether it's faster.

Agree completely. Did you try other BLAS implementations to see? They likely also had different results, so it probably falls in the same bucket, but might be worth a gander.

SteveBronder commented 3 years ago

Thanks @blackwer!! Couple little notes

The code spends a non-negligible amount of time pushing things to the vari_base* stack. This improves with more modern compilers, but is still probably ~5% of execution time on this example for the x86_64 machines. It could be better or worse, it's kinda painstaking to isolate because the stack is modified everywhere.

Yes so this has vex'd us for a while, the vari_base std::vector<vari_base*> stack is a struct which has pointers (or smart pointer like objects) for data needed in a reverse pass calculation and a method for executing a reverse pass calculation. The blog post here goes over this in more detail. Everytime we use a var type in an operation we place a type inheriting from vari_base onto that stack. We call the chain() method for each reverse pass callback via a virtual function scheme which feels not optimal but every time we've tried rewriting this we haven't found a way that's faster. The vari_base is also the abstract base class of our vari that sit inside of Stan's scalar type, var (our scalar type follows a PIMPL/smart pointer form. The below is a pseudo-code example of the structure.

// abstract base class
struct vari_base {
  virtual void chain() = 0;
}

// internal impl
struct vari : vari_base {
  // vtable ptr for chain()
  double val_;
  double adj_;
}

// smart pointer like class
struct var {
  vari* vi_;
};

There are likely techniques for more rapid evaluation of heavy 1-dimensional functions, which can really hurt performance on certain problems. I have one such utility for this, but it's on a limited domain/range. There are still probably very good ways to re-tool it to handle these problems.

apply_scalar_unary can probably be vectorized for many functions.

So these two sort of go hand in hand. apply_scalar_unary is much more for convenience than performance. It let's us write much fewer function that works on objects that are containers of containers like std::vector<std::vector<Eigen::Matrix<var, -1, -1>>>. It would be better if we had reverse mode specializations for all unary functions that take in eigen matrices. That would certainly help out with memory, but then we hit your next point

Vector instructions are currently under-utilized in my limited profiling, but I will investigate this more

This is where the culmination of all the above leads to not being able to have vectorized instructions. Since our base var scalar type is not POD, eigen (and any autovectorizer in a compiler) will not be able to auto gen vectorized instructions. Our approach instead has been to add a template to vari that enables the value and adjoint to be Eigen matrices / arrays instead of single scalars such as the below pseudocode. An easy way to think of this is that normally we do an Array of Structs approach to matrices, but in the below we are doing a Struct of Arrays approach.

struct vari {
  // vtable pointer to chain() method
  Eigen::MatrixXd val_;  
  Eigen::MatrixXd adj_;  
};

So then when we do the reverse pass for matrix operations and we access the values and adjoints of the matrix we can actually have vectorized instructions.

The only alternative to the above would involve adding packet math for var types for each instruction set Eigen has specializations for. This would would involve unpacking packs of the values and adjoints in the varis of the vars of the matrix into seperate packets, storing those packets on our arena allocator, then using intrinsics over those packets when running the forward and reverse pass. If we didn't have the vtable pointer inside of the vari this might not actually be that bad, but my ballpack has always been that because we'd have to work around each of the vtable pointers inside of the vari when doing unpacking that the rest of the custom packet math would not be worth it.

syclik commented 3 years ago

Did you try other BLAS implementations to see? They likely also had different results, so it probably falls in the same bucket, but might be worth a gander.

I haven't tried personally, but I do think others have.

If you end up with any results (positive, negative, same), please let us know!