Combining Cuba.jl with GPU

CharlesRSmith44 commented 3 years ago

Hello,

I'm trying to utilize gpu computation using Cuda.jl to speed up calculating integrals. Is it possible to do so? If so, how? Here is my example code:

using Cuba, Distributions
using BenchmarkTools, Test, CUDA
using FLoops, FoldsCUDA
using SpecialFunctions

# user input
M= 10  #number of independent uniform random variables
atol=1e-6
rtol=1e-3
nvec=1000000
maxevals=100000000

# setting up integration function
function int_thread_col_gpu_precompile_test(x, f)
    w = CuArray(x)
   @floop CUDAEx() for i in 1:size(x,2)
  #@floop for i in 1:size(x,2)
      val = reduce(*,@view(w[:,i]))
        f[i] = val
    end
    return f
end

cuhre(int_thread_col_gpu_precompile_test, M, 1, atol=atol, rtol=rtol)
suave(int_thread_col_gpu_precompile_test, M, 1, atol=atol, rtol=rtol, nvec = nvec, maxevals = maxevals)

If I use the CUDAEx() call, the code errors. If I don't the code works fine, but isn't exploiting the GPU effectively.

If I include the CUDAEx() call, the error message is

GPU compilation of kernel transduce_kernel!(Nothing, Transducers.Reduction{Transducers.Map{typeof(first)}, Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}}}, Transducers.InitOf{Transducers.DefaultInitOf}, Int64, Base.OneTo{Int64}, UnitRange{Int64}) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type Transducers.Reduction{Transducers.Map{typeof(first)}, Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}}}, which is not isbits:
  .inner is of type Transducers.BottomRF{Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"}} which is not isbits.
    .inner is of type Transducers.AdHocRF{var"#__##oninit_function#434#22", typeof(identity), InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}}, typeof(identity), var"#__##combine_function#436#24"} which is not isbits.
      .next is of type InitialValues.AdjoinIdentity{var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}}} which is not isbits.
        .op is of type var"#__##reducing_function#435#23"{Matrix{Float64}, CuDeviceMatrix{Float64, 1}} which is not isbits.
          .f is of type Matrix{Float64} which is not isbits.

CharlesRSmith44 commented 3 years ago

Alternatively, I can bind arrays as CuArrays after drawing the random variables, but the time cost of doing so seems prohibitive. Instead, it seems much faster to avoid using the GPU and use the CPU exclusively. Is there anyway to decrease the time fo the gpu option? I.e. could I draw random variables into the GPU initially rather than having to transfer them over? I think the transfering from the gpu to the cpu is what makes the code so slow for the gpu option.

For example:

import Pkg
Pkg.activate("joint_timing")
Pkg.instantiate()
using Cuba, Distributions
using BenchmarkTools, Test, CUDA
using FLoops, FoldsCUDA
using SpecialFunctions

@test Threads.nthreads()>1

# User Inputs
M= 5 # number of independent uniform random variables
atol=1e-6
rtol=1e-3
nvec=1000000
maxevals=100000000

# Initializing Functions
function int_cpu(x, f)
   f[1] = pdf(Product(Beta.(1.0,2.0*ones(M))),x)
end

function int_cpu2(x, f)
   f[1] = vec(prod(x'.^(1.0-1.0) .* (1.0 .- x').^(2.0-1.0)./(gamma(1.0)*gamma(2.0)/gamma(3.0)),dims=2))[1]
end

function beta_pdf_gpu(x, a, b)
     prod(x.^(a-1.0f0) .* (1.0f0 .- x).^(b-1.0f0)./(gamma(a)*gamma(b)/gamma(a+b)),dims=1)
end

function int_gpu(x, f)
   f[1] = vec(beta_pdf_gpu(CuArray(x),1.0f0,2.0f0))[1]
end

display(@benchmark cuhre($int_gpu, $M, 1, atol=$atol, rtol=$rtol)) # 70 ms for M = 5, 11.7 s for M = 15)
display(@benchmark cuhre($int_cpu, $M, 1, atol=$atol, rtol=$rtol)) # (2.0 ms for M = 5, 650ms for M=15)
display(@benchmark cuhre($int_cpu2, $M, 1, atol=$atol, rtol=$rtol)) # (500 mus for M = 5, 100ms for M = 15, 38s for M = 25)

giordano commented 3 years ago

I'm not really sure what you want to do here. Cuba.jl calls into a C shared library which runs on the CPU, so you're bound to have to constantly moving data between GPU and CPU, there is really no way around it, and whether this is worth or not depends on your specific problem, but in vast majority of (or all?) cases I'd assume it isn't. I'm going to close this ticket because I don't think there is any action to take here.

It may be interesting to see whether HCubature.jl works out-of-the-box with CuArrays: that's a pure-Julia package which implements the same algorithm as the cuhre function here

CharlesRSmith44 commented 3 years ago

Thank you for the quick response! I will look into that.

giordano / Cuba.jl

Combining Cuba.jl with GPU #36