lgamma is not thread safe

wds15 commented 5 years ago

Description

We have moved away from using boost functions for most math specialty functions in commit 01bb6645815cd5546701c0575fcb4f38d4a41cca. This affected in particular the log gamma function. Instead we now use the std::lgamma functions which turn out to cause problems in threading contexts. The behaviour is generally undefined wrt to threading safety, see here.

On macOS Mojave the lgamma implementation of clang will introduce locking of a global mutex as can be seen by terrible scaling of multi-threaded benchmarks, see here.

This is a large problem for any threading based applications involving Stan (parallel chains or httpstan).

Example

See the discourse thread linked above for an example.

Expected Output

We should use a thread-safe version of the lgamma - and ideally one which does not lock a global mutex which drastically limits parallelisation performance under threading. At the worst we should distinguish between threading and non-threading.

Right now the behaviour of lgamma when using threads is implementation specific and generally undefined to my understanding.

Current Version:

v2.19.1

bob-carpenter commented 5 years ago

Are you suggesting using the Boost version of lgamma? It's an easy fix, but we have to be careful again about edge conditions and where we want to throw. I still don't know that exception behavior for our math library has been determined, so I'm punting to @seantalts and @syclik for clarification.

seantalts commented 5 years ago

It's whatever the doc string for lgamma says: http://mc-stan.org/math/d4/d84/namespacestan_1_1math.html#aead76f03bdbc60484ad760fc31bad40f

bob-carpenter commented 5 years ago

I'm confused. The doc string for lgamma doesn't say whether to use Boost or not to implement it.

bob-carpenter commented 5 years ago

Or do you mean which errors it throws, which it is also silent about.

Are you saying that our API is defined by these incomplete doc strings?

On May 23, 2019, at 2:37 PM, Bob Carpenter carp@alias-i.com wrote:

I'm confused. The doc string for lgamma doesn't say whether to use Boost or not to implement it.

seantalts commented 5 years ago

Yeah, what we promise in the doxygen comments is what we're holding ourselves to in theory.

Since it doesn't mention exceptions it seems like we have made no promises there. Would it be hard to match whatever behavior we currently have?

wds15 commented 5 years ago

Yes, I am suggesting to revert to commit hash I am quoting above where we replace boost with the std version. However, another area for potential trouble is performance... from my quick tests the boost version is 3x slower than the std version. While the boost version has a well defined precision according to a policy, the std version does not let us control this in any way. Thus, we may have to balance performance vs accuracy, but we don't know where we stand right now.

I can start on a PR where I revert the above commit.

seantalts commented 5 years ago

We should make a note of how the behavior changes just to be polite and add it to the release notes one day (maybe put a section on that in the top of the PR or issue, something like "for release notes: lgamma now throws exceptions..." or whatever. I think that seems reasonable. Another weird thing we could do would be to use std::lgamma if we're not in threading mode, but I think that would be too confusing.

bob-carpenter commented 5 years ago

I was hoping instead of laying down a letter of the law based on our spotty code documentation, that we'd be moving to consistent treatment of edge cases (+inf, -inf, poles, NaN, and out-of-domain inputs). I also hoped we'd be adding documentation to clarify behavior rather than avoiding it to allow us wiggle room w.r.t. backward compatibility.

seantalts commented 5 years ago

That sounds good, why don't you write up a proposal for how you want to do that as a design doc?

sakrejda commented 5 years ago

Just linking previous discussion

syclik commented 5 years ago

@wds15: does std::lgamma misbehave with threading? Can you write a test that exhibits the failure? For a bigger question: what other parts of the library aren't really going to work with threading?

I think the issue itself could use more work. What is the ideal outcome? Will using std::lgamma with our current compilers fail us? (Or is it just slow?) This should only affect threaded applications where a Stan program is written with lgamma in it?

That sounds good, why don't you write up a proposal for how you want to do that as a design doc?

@seantalts: I'm not sure how a policy discussion would fall into a design doc (for code)? Maybe my imagination isn't as good as yours.

seantalts commented 5 years ago

@seantalts: I'm not sure how a policy discussion would fall into a design doc (for code)? Maybe my imagination isn't as good as yours.

Yeah, design doc (for code) is not the right kind - but some kind of Request For Comments draft would probably be a good way to get traction on that issue.

wds15 commented 5 years ago

@syclik std::lgamma is not guaranteed to be thread-safe as per documentation. In fact you get some implementation on a given platform which will behave in a non-defined behaviour wrt to threading. On my macOS system the implementation obviously uses internally a global mutex to access a global variable. I am not sure how I can write a test for such an internal mechanism of the function (maybe by using negative and positive arguments, but I do not know that). Writing a test for this is very hard - right now I could only detect it using performance benchmarks which do not scale with multiple threads, but that isn't a good unit test, right?

Using std::lgamma has the disadvantages of

having undefined behaviour across platforms wrt to threading. On macOS performance is seriously hampered when used in parallel due to an internal mutex.
the output accuracy is uncontrolled - at least I cannot find any notes on the accuracy of that function

However, when I drop-in replaced std::lgamma with the boost version, then the boost version was ~3x slower than std.

The best outcome is to use the boost version of lgamma for all our applications and hopefully when the boost version is asked to output similar accuracy as the std version then the speed is the same of both functions.

And yes, this problem is with probably a few more gamma functions and maybe even with more functions which we do not know about yet.

bob-carpenter commented 5 years ago

It's not clear what we're talking about for thread safety here. You mean the ability to concurrently call a function?

The container classes (std::vector, Eigen::Matrix) are not thread safe, for example, but I don't think our library ever accesses a container concurrently. The log density accumulator isn't thread safe.

None of our autodiff classes are thread safe. Our autodiff functionals are not thread safe.

Static constant initialization is guaranteed to be thread safe. But nothing with a mutable static is going to be thread safe.

Is the Boost version of lgamma thread safe? Does it allow accuracy control? I don't think Stan has any requirements for accuracy.

Here's what the spec says for lgamma:

The POSIX version of lgamma is not thread-safe: each execution of the function stores the sign of the gamma function of arg in the static external variable signgam. Some implementations provide lgamma_r, which takes a pointer to user-provided storage for singgam as the second parameter, and is thread-safe.

Neither tgamma nor digamma have similar qualifications about thread safety.

On May 27, 2019, at 4:00 AM, wds15 notifications@github.com wrote:

@syclik std::lgamma is not guaranteed to be thread-safe as per documentation. In fact you get some implementation on a given platform which will behave in a non-defined behaviour wrt to threading. On my macOS system the implementation obviously uses internally a global mutex to access a global variable. I am not sure how I can write a test for such an internal mechanism of the function (maybe by using negative and positive arguments, but I do not know that). Writing a test for this is very hard - right now I could only detect it using performance benchmarks which do not scale with multiple threads, but that isn't a good unit test, right?

Using std::lgamma has the disadvantages of

• having undefined behaviour across platforms wrt to threading. On macOS performance is seriously hampered when used in parallel due to an internal mutex. • the output accuracy is uncontrolled - at least I cannot find any notes on the accuracy of that function However, when I drop-in replaced std::lgamma with the boost version, then the boost version was ~3x slower than std.

The best outcome is to use the boost version of lgamma for all our applications and hopefully when the boost version is asked to output similar accuracy as the std version then the speed is the same of both functions.

And yes, this problem is with probably a few more gamma functions and maybe even with more functions which we do not know about yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 5 years ago

That's right - all of stan-math is not thread safe unless you run it with STAN_THREADS turned on. In that case we can run things concurrently with map_rect. In this setting we have a problem with std::lgamma for these reasons:

There is no thread safety guarantee. So concurrent calls to std::lgamma from concurrent map_rect context's will be problematic.
It turns out that on macOS the behavior of std::lgamma whenever threads are used is such that a global mutex is used to synchronize the calls (likely to deal with the sign of the gamma function). This synchronization makes the implementation on macOS per se thread safe, but it kills off any performance gains through parallelization.
The worst thing is that the thread behavior of that function is implementation dependent (OS, compiler, libstdc++, whatever)

Does that explain?

bob-carpenter commented 5 years ago

I still have the question of whether Boost's lgamma is thread safe. Is the proposal just to use that instead of std::lgamma?

STAN_THREADS only makes the memory access thread safe. It doesn't make things like containers (Eigen::Matrix or std::vector) or our stan::math::vari instances thread safe for mutability.

On May 28, 2019, at 1:37 PM, wds15 notifications@github.com wrote:

That's right - all of stan-math is not thread safe unless you run it with STAN_THREADS turned on. In that case we can run things concurrently with map_rect. In this setting we have a problem with std::lgamma for these reasons:

• There is no thread safety guarantee. So concurrent calls to std::lgamma from concurrent map_rect context's will be problematic. • It turns out that on macOS the behavior of std::lgamma whenever threads are used is such that a global mutex is used to synchronize the calls (likely to deal with the sign of the gamma function). This synchronization makes the implementation on macOS per se thread safe, but it kills off any performance gains through parallelization. • The worst thing is that the thread behavior of that function is implementation dependent (OS, compiler, libstdc++, whatever) Does that explain?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

seantalts commented 5 years ago

STAN_THREADS only makes the memory access thread safe. It doesn't make things like containers (Eigen::Matrix or std::vector) or our stan::math::vari instances thread safe for mutability.

That's why we only support threading via map_rect :D

bob-carpenter commented 5 years ago

On May 28, 2019, at 2:00 PM, seantalts notifications@github.com wrote:

STAN_THREADS only makes the memory access thread safe. It doesn't make things like containers (Eigen::Matrix or std::vector) or our stan::math::vari instances thread safe for mutability.

That's why we only support threading via map_rect :D

My understanding is that std::lgamma's lack of thread safety becomes a problem precisely because map_rect might call it from multiple threads. So I believe map_rect imposes thread-safety requirements on every single Stan function that might be evaluated in a Stan program. Generally, we haven't allowed side effects other than the autodiff stack, so that's not a problem. But this is partly why proposals to do things like statically store state to restart solvers is going to be tricky. And why we better make sure every static local variable is const.

As things stand, the output of print(...) will get garbled as the underlying I/O is thread safe but not atomic at the message level.

Aren't other issues/PRs in the works to relax the restriction of threading to map_rect? Such as allowing parallel for loops?

seantalts commented 5 years ago

My understanding is that std::lgamma's lack of thread safety becomes a problem precisely because map_rect might call it from multiple threads.

Yep. Just highlighting that the only sense of "thread safety" we need to worry about is via map_rect. If the design doc for parallel for goes through, that will need to use the same mechanisms, which I believe is the plan.

wds15 commented 5 years ago

Things need to be safe if concurrently executed in different local contexts....and static variables can be a source of problems if these are not const, yes.

syclik commented 5 years ago

@bob-carpenter is right. map_rect is imposing thread-safety requirements on any Stan function that might be evaluated in the function passed to map_rect. We haven't been so careful about this since thread safety wasn't a requirement, but I think we should be much more careful going forward.

Fortunately, I think lgamma is one of the few functions we would use that are known to not be thread-safe. See this list http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_01. (I see it linked everywhere... I don't know if it covers all of the standard library or if we really need to be digging through the doc there.)

We don't have this problem with MPI because it uses separate processes.

Part of me thinks we should reevaluate including threading for now while we sort out these problems; we essentially need to search through all library functions we use to see if they're not thread-safe. But we can pick that discussion up on discourse.

syclik commented 5 years ago

For now, I think it's reasonable to describe the qualities of lgamma that we want, to create some tests for it, and to verify that the boost version has the characteristics that we want. Then switch to that implementation.

I am not sure how I can write a test for such an internal mechanism of the function (maybe by using negative and positive arguments, but I do not know that). Writing a test for this is very hard - right now I could only detect it using performance benchmarks which do not scale with multiple threads, but that isn't a good unit test, right?

@wds15: we need to write a test that flexes what we think could be an issue when running the function with threading. Yes, it might not be easy, but this is a necessary condition. How else will you prevent someone in the future from looking at boost::lgamma and thinking... "oh, I'll switch that to std::lgamma"? And maybe we can replace it in the future. Or we might be able to avoid a problem with a future implementation.

Re: nomenclature. I don't care if it's not called a unit test, but it's what we need to test.

bgoodri commented 5 years ago

The link says

The POSIX version of lgamma is not thread-safe: each execution of the function stores the sign of the gamma function of arg in the static external variable signgam.

which I have always thought was a non-issue for us because nothing in Stan Math is or should be writing or reading signgam.

syclik commented 5 years ago

@bgoodri: that makes sense. Does that mean it's the results will be safe with multiple threads? Does it explain why @wds15 has seen a performance hit?

bgoodri commented 5 years ago

The title of this issue shouldn't be "lgamma is not thread safe"; rather it should be "signgam is not thread safe". So, if it is true that no one is using signgam then the results are safe and I wouldn't be surprised if the std version were faster than the Boost version. The C++ standards gurus also require accuracy within 1 ULP or something, so we are probably fine on that front.

For statistics in general and HMC in particular, I have a hard time thinking of an example where someone would want to access signgam. Typically, we already are requiring the input to be non-negative; otherwise HMC would have the problems it already has with abs or square on input that can be either negative or positive. Worse, the gamma function is crazy for negative input with oscillations and singularities, so I don't see how anyone would use Stan to do statistical inference unless the input was already constrained to be non-negative.

We could introduce a lgamma_r function to Stan Math and / or have stan::math::lgamma set signgam to -2147483647.

sakrejda commented 5 years ago

@bob-carpenter re: boost::math and thread safety: https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/threads.html

bob-carpenter commented 5 years ago

Tests are great if feasible. It can be really hard to test these threading bugs---I've seen cases that take 48 hours on 72 concurrent threads to arise.

How else will you prevent someone in the future from looking at boost::lgamma and thinking... "oh, I'll switch that to std::lgamma"?

lgamma isn't special---we have the same requirements on all of our functions. The only difference is that it has a tempting replacement in std::lgamma. My suggestion is to doc where Boost's version is used saying that the std:: version isn't thread safe.

I don't know how to stop people from calling std::lgamma elsewhere when they shouldn't. Currently, the only place std::lgamma is used is in stan/math/prim/scal/fun/lgamma.hpp and that should probably be the only place Boost's gamma is used.

bob-carpenter commented 5 years ago

https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/threads.html

That's convenient. It just says:

The library is fully thread safe and re-entrant for all functions regard[les]s of the data type they are instantiated on. Thread safety limitations relating to user defined types present in previous releases (prior to 1.50.0) have been removed.

bgoodri commented 5 years ago

lgamma isn't special---we have the same requirements on all of our functions. The only difference is that it has a tempting replacement in std::lgamma.

What other standard functions have the side effect of setting a global variable like signgam?

bob-carpenter commented 5 years ago

How else will you prevent someone in the future from looking at boost::lgamma and thinking... "oh, I'll switch that to std::lgamma"?

By imposing thread safety requirements and assuming people know the language. It's how we do everything. We can't guard against generic programmer error. We have exactly the same problem with every one of our functions, though they don't come with such a tempting replacement.

If you can formulate a test that's reliable, that's great, but it's a notoriously hard problem in a multithreaded context to trigger bugs. I've seen threading bugs that took on the order of days to trigger given dozens of concurrent threads.

bob-carpenter commented 5 years ago

None that I can find searching for thread safety. None of the other gamma functions have the same qualification about threading in their doc. I have no idea why lgamma is different than the other lib functions.

On May 30, 2019, at 2:17 PM, bgoodri notifications@github.com wrote:

lgamma isn't special---we have the same requirements on all of our functions. The only difference is that it has a tempting replacement in std::lgamma.

What other standard functions have the side effect of setting a global variable like signgam?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

syclik commented 5 years ago

What other standard functions have the side effect of setting a global variable like signgam?

@bgoodri: it looks like functions that are related (I'm seeing the same list on different sites and can find a source if you want it): gamma(), lgamma(),lgammaf(),lgammal()

I don't know if there are other functions in Eigen or elsewhere that have these behaviors.

More importantly, this means that any Math function implementation can't really use static variables to compute things. This is a new requirement. That would have been bad practice, but we haven't explicitly checked for it or forbidden it. The only places where I could think we might even attempt to do that is in optimized code. There's a lot of cleverness there with separate memory management and we should probably comb through that very carefully.

By imposing thread safety requirements and assuming people know the language. It's how we do everything. We can't guard against generic programmer error. We have exactly the same problem with every one of our functions, though they don't come with such a tempting replacement.

Yup. What's the best way to impose thread safety requirements?

If you can formulate a test that's reliable, that's great, but it's a notoriously hard problem in a multithreaded context to trigger bugs. I've seen threading bugs that took on the order of days to trigger given dozens of concurrent threads.

@wds15 was able to demonstrate that std::lgamma performs differently from boost::lgamma with a few threads. That's what generated this issue. That should be one of the tests, if not the test.

bob-carpenter commented 5 years ago

@bgoodri: it looks like functions that are related (I'm seeing the same list on different sites and can find a source if you want it): gamma(), lgamma(),lgammaf(),lgammal()

gamma() isn't in the standard library in C++.

lgammaf() and lgammal() are just the float and long form argument forms of lgamma().

By imposing thread safety requirements and assuming people know the language. It's how we do everything. We can't guard against generic programmer error. We have exactly the same problem with every one of our functions, though they don't come with such a tempting replacement.

Yup. What's the best way to impose thread safety requirements?

My response was above and I'll stick by it. We can't test this programmatically in a proactive way.

If you can formulate a test that's reliable, that's great, but it's a notoriously hard problem in a multithreaded context to trigger bugs. I've seen threading bugs that took on the order of days to trigger given dozens of concurrent threads.

@wds15 was able to demonstrate that std::lgamma performs differently from boost::lgamma with a few threads. That's what generated this issue. That should be one of the tests, if not the test.

Great---that's what I meant by being able to formulate a reliable test.

wds15 commented 5 years ago

A test for this will make everything brittle...I need to launch 4 threads and observe that the speed needs to be 4x as fast. Doubling was not enough...this implies I need 4 idle cores for the test to run fine...on top of that this the behavior on mac...I don’t know how things are under Linux or windows...it could be different there.

stan-dev / math