A tutorial on how to use threads with preallocation would be good for the documentation

Hello,

thanks for creating LogDensityProblems.jl!

I saw the issue #6 that the automatic differentiation is not thread safe. In my implementation, it seems that it is now thread safe to use ADgradient with threads inside the log density (likelihood+prior) calculation. I have to run my simulation model for several sites with the same parameter set and calculate the total likelihood. However, I then get many allocations and the garbage collection time is above 80 %.

Usually I preallocate many vectors in a named tuple to avoid allocations. Of course, if I just use my named tuple, I get an error from ForwardDiff.jl. I tried GeneralLazyBufferCache from PreallocationTools.jl to wrap my named tuple. However, this seems to be not thread safe. I used the old way of using Threads with threadid() to get the preallocated named tuple for thread. Maybe also here is the problem.

I am a bit confused. I think it would be nice to have a tutorial that shows how to use LogDensityProblems with preallocation (with PreallocationTools.jl?) and threads (maybe with FLoops.jl?).

Is the documentation up to date? There it is written that I should copy the gradient. However, I want to use threads in side the calculation of the likelihood, so I am not sure if this makes sense for me. The documentation says:

  ADgradient(Val(:ForwardDiff), ℓ; chunk, tag, x)

  Wrap a log density that supports evaluation of Value to handle ValueGradient, using ForwardDiff.

  Keyword arguments:

    •  chunk can be used to set the chunk size, an integer or a ForwardDiff.Chunk

    •  tag (default: nothing) can be used to set a tag for ForwardDiff. If tag is neither nothing nor a ForwardDiff.Tag, the tag for        
       ForwardDiff is set to ForwardDiff.Tag(tag, eltype(x)) where x is the vector at which the logdensity and its gradient are
       evaluated.

    •  x (default: nothing) will be used to preallocate a ForwardDiff.GradientConfig with the given vector. With the default, one is        
       created for each evaluation.

  Note that pre-allocating a ForwardDiff.GradientConfig is not thread-safe. You can copy the results for concurrent evaluation:

  ∇ℓ1 = ADgradient(:ForwardDiff, ℓ; x = zeros(dimension(ℓ)))
  ∇ℓ2 = copy(∇ℓ1) # you can now use both, in different threads

  See also the ForwardDiff documentation regarding ForwardDiff.GradientConfig
  (https://juliadiff.org/ForwardDiff.jl/stable/user/api/#Preallocating/Configuring-Work-Buffers) and chunks and tags
  (https://juliadiff.org/ForwardDiff.jl/stable/user/advanced/).

Best Felix

The docs above refer to callers of the interface. The thread safety of your log density implementation is orthogonal to this.

That said, if you want to use pre-allocated buffers insider your log density implementation, you could

use a struct that has a buffers,
call ADgradient with a deep copy on this for each thread, so that each gets an independent structure.

But note that for larger dimensions, reverse mode (eg Enzyme) is much more efficient and has no similar issues.

thanks for your answer!

I have a function that preallocated all vectors, and then I create buffers for ForwardDiff and for nomal evaluations:

function preallocate(; Tdiff)
    normal = preallocate_vectors(; T = Float64)
    diff = preallocate_vectors(; T = Tdiff)
    return (; normal, diff)
end

For ForwardDiff I use the type of the gradient config (this was the missing step!):

∇ℓ = ADgradient(:ForwardDiff, ℓ; chunk = 5, x = init_x)

# Preallocation
prealloc = NamedTuple[]
for _ in 1:Threads.nthreads()
    c = preallocate(; Tdiff = eltype(∇ℓ.gradient_config))
    push!(prealloc, c)
end

and then I have a function to get the buffer depending on the type T of the parameter:

function get_buffer(buffer, T)
    if T <: ForwardDiff.Dual
        return buffer.diff
    elseif T <: Float64
        return buffer.normal
    end
end

and my loglikelihood function looks like this:

function my_loglik(θ)
    θ_type = eltype(θ)
    loglik_vec = Array{θ_type}(undef, length(training_plots))

    Threads.@threads for i in eachindex(training_plots)
          loglik_vec[i] = loglikelihood_model(
              plotID = training_plots[i],
              θ, buffer = get_buffer(prealloc[Threads.threadid()], θ_type))
    end

    return sum(loglik_vec)
end

Is this general approach good?

I have a follow up question that is not related to the LogDensityProblemsAD.jl package: Is ok to just use Threads.threadid() to index the buffer? The loglikelihood evaluations are not dependent on each other, so the order of the evaluations does not matter. I get always the same results and the result matches the serial implementation, but in the documentations it is written that user should avoid using Threads.threadid to index buffers. Sorry for asking the question here.

https://docs.julialang.org/en/v1/manual/multi-threading/#man-task-migration

After a task starts running on a certain thread it may move to a different thread if the task yields.
...
This means that in most cases threadid() should not be treated as constant within a task, and therefore should not be used to index into a vector of buffers or stateful objects.

tpapp / LogDensityProblemsAD.jl

A tutorial on how to use threads with preallocation would be good for the documentation #25