Closed sethaxen closed 1 year ago
I suspect the issue is with a shared GradientConfig
. AbstractDifferentiation
creates a GradientConfig
in its gradient
function and I don't think suffers from this issue: https://github.com/JuliaDiff/AbstractDifferentiation.jl/blob/eb5d913b9e4cbd31465af4ee2a75acb7f69ded91/src/forwarddiff.jl#LL47C57-L47C57
IMO that seems to be a ForwardDiff issue? At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46
Regarding AbstractDifferentiation, if I remember correctly the only reason why it creates a config at all is to set the chunk size. Otherwise they would not be needed.
At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46
This is the documented behavior. It seems the main purpose of the config is to allocate work buffers: https://juliadiff.org/ForwardDiff.jl/stable/user/api/#Preallocating/Configuring-Work-Buffers. See also this comment: https://github.com/JuliaDiff/ForwardDiff.jl/issues/573#issuecomment-1410379617
I see. Then probably we should add a note in LogDensityProblemsAD as well that it is not threadsafe.
Since it is much more efficient to preallocate a config, as stated in the ForwardDiff docs
ForwardDiff's basic API methods will allocate these types automatically by default, but you can drastically reduce memory usage if you preallocate them yourself.
I think generally one wants to create and use the config object, also within LogDensityProblemsAD. To me it seems that it's not thread-safe seems to be a design (issue?) of ForwardDiff, so users should be aware of it; but making it inefficient for everyone - similar to the code in AbstractDifferentiation - even though here in contrast to AbstractDifferentiation we know the function AND the size of the input, seems undesirable to me.
I can think of the following (not mutually exclusive) solutions:
Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid()
, but I have to think through the invariants.
Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.
That is not guaranteed to work anymore in newer Julia versions, generally it is not safe to use Threads.threadid()
in such cases. See e.g. https://juliafolds.github.io/FLoops.jl/stable/explanation/faq/#faq-state-threadid and the discussion in https://discourse.julialang.org/t/behavior-of-threads-threads-for-loop/76042?u=devmotion.
provide an option w/o preallocated buffers, slower but safe
Maybe the easiest option here would be to support passing gradientconfig = nothing
in ADgradient
, and handling it by constructing a new config object (to support setting chunk size nevertheless) when calling logdensity_and_gradient
.
I would suggest the following solution:
ADgradient
for the relevant backends, explicitly mentioning that it is not thread-safe,Base.copy
that allows independently usable copies where applicable (eg gradients created with the ForwardDiff backend), and is otherwise a no-op.This would allow users to take care of threading as they wish --- most MCMC applications would just create as many copies as they need, and sample in threads.
Incidentally, I do not understand how ReverseDiff is not affected. It uses the same kind of buffers.
When using multiple threads, the gradient computations disagree with the gradients one gets when using a single thread. Here's an MWE:
On my machine (using 4 threads), I get the following result:
This appears to be the same issue as https://github.com/JuliaDiff/ForwardDiff.jl/issues/573, where the signature
ForwardDiff.gradient(f, x, cfg)
, which is used here, is shown to be not thread-safe. By comparison, ReverseDiff seems to be fine.