AD with ForwardMode is not thread-safe

sethaxen commented 1 year ago

When using multiple threads, the gradient computations disagree with the gradients one gets when using a single thread. Here's an MWE:

using ForwardDiff, LinearAlgebra, LogDensityProblems, LogDensityProblemsAD

struct LogDensityFunction{F}
    logp::F
    dim::Int
end
function LogDensityProblems.capabilities(::Type{<:LogDensityFunction})
    return LogDensityProblems.LogDensityOrder{0}()
end
LogDensityProblems.dimension(ℓ::LogDensityFunction) = ℓ.dim
LogDensityProblems.logdensity(ℓ::LogDensityFunction, x) = ℓ.logp(x)

dim = 10
n = 10^6
P = LogDensityFunction(x -> -sum(abs2, x) / 2, dim)
ℓ = ADgradient(:ForwardDiff, P)
x = randn(dim, n)

results_single = zeros(dim, n)
results_threaded = zeros(dim, n)

for j in 1:n
    results_single[:, j] = LogDensityProblems.logdensity_and_gradient(ℓ, x[:, j])[2]
end

Threads.@threads for j in 1:n
    results_threaded[:, j] = LogDensityProblems.logdensity_and_gradient(ℓ, x[:, j])[2]
end

On my machine (using 4 threads), I get the following result:

julia> norm(results_single .- results_threaded)
2588.220046954836

This appears to be the same issue as https://github.com/JuliaDiff/ForwardDiff.jl/issues/573, where the signature ForwardDiff.gradient(f, x, cfg), which is used here, is shown to be not thread-safe. By comparison, ReverseDiff seems to be fine.

sethaxen commented 1 year ago

I suspect the issue is with a shared GradientConfig. AbstractDifferentiation creates a GradientConfig in its gradient function and I don't think suffers from this issue: https://github.com/JuliaDiff/AbstractDifferentiation.jl/blob/eb5d913b9e4cbd31465af4ee2a75acb7f69ded91/src/forwarddiff.jl#LL47C57-L47C57

devmotion commented 1 year ago

IMO that seems to be a ForwardDiff issue? At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46

devmotion commented 1 year ago

Regarding AbstractDifferentiation, if I remember correctly the only reason why it creates a config at all is to set the chunk size. Otherwise they would not be needed.

sethaxen commented 1 year ago

At least it seems somewhat surprising that gradient evaluation mutates the gradient config but that seems what happens here: https://github.com/JuliaDiff/ForwardDiff.jl/blob/4b143a199f541e7c1248dc5bc29b397be938e81d/src/apiutils.jl#L34-L46

This is the documented behavior. It seems the main purpose of the config is to allocate work buffers: https://juliadiff.org/ForwardDiff.jl/stable/user/api/#Preallocating/Configuring-Work-Buffers. See also this comment: https://github.com/JuliaDiff/ForwardDiff.jl/issues/573#issuecomment-1410379617

devmotion commented 1 year ago

I see. Then probably we should add a note in LogDensityProblemsAD as well that it is not threadsafe.

devmotion commented 1 year ago

Since it is much more efficient to preallocate a config, as stated in the ForwardDiff docs

ForwardDiff's basic API methods will allocate these types automatically by default, but you can drastically reduce memory usage if you preallocate them yourself.

I think generally one wants to create and use the config object, also within LogDensityProblemsAD. To me it seems that it's not thread-safe seems to be a design (issue?) of ForwardDiff, so users should be aware of it; but making it inefficient for everyone - similar to the code in AbstractDifferentiation - even though here in contrast to AbstractDifferentiation we know the function AND the size of the input, seems undesirable to me.

tpapp commented 1 year ago

I can think of the following (not mutually exclusive) solutions:

warn users about this, suggesting that they create an AD'd object for each thread, but that only covers simple use cases like MCMC (however, that may just work for 99% of our users),
provide an option w/o preallocated buffers, slower but safe

Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.

devmotion commented 1 year ago

Or perhaps allocate a buffer for each thread, and have the call write to it based on Threads.threadid(), but I have to think through the invariants.

That is not guaranteed to work anymore in newer Julia versions, generally it is not safe to use Threads.threadid() in such cases. See e.g. https://juliafolds.github.io/FLoops.jl/stable/explanation/faq/#faq-state-threadid and the discussion in https://discourse.julialang.org/t/behavior-of-threads-threads-for-loop/76042?u=devmotion.

devmotion commented 1 year ago

provide an option w/o preallocated buffers, slower but safe

Maybe the easiest option here would be to support passing gradientconfig = nothing in ADgradient, and handling it by constructing a new config object (to support setting chunk size nevertheless) when calling logdensity_and_gradient.

tpapp commented 1 year ago

I would suggest the following solution:

make the default threadsafe, if potentially inefficient, for all backends and document this,
offer the faster solution in ADgradient for the relevant backends, explicitly mentioning that it is not thread-safe,
implement a method for Base.copy that allows independently usable copies where applicable (eg gradients created with the ForwardDiff backend), and is otherwise a no-op.

This would allow users to take care of threading as they wish --- most MCMC applications would just create as many copies as they need, and sample in threads.

Incidentally, I do not understand how ReverseDiff is not affected. It uses the same kind of buffers.

tpapp / LogDensityProblemsAD.jl

AD with ForwardMode is not thread-safe #6