Error in scalar reduction on the GPU

ali-ramadhan commented 3 years ago

I hope I'm not doing it wrong but I've been playing around with Tullio.jl (super neat package!) and I can't seem to get this simple expression to work on the GPU:

julia> using CUDA, KernelAbstractions, Tullio

julia> A = rand(5, 5) |> CuArray;

julia> B = rand(5, 5) |> CuArray;

julia> @tullio C = A[i, j] / 2 + B[i, j] / 3

produces this stacktrace

ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...
Stacktrace:
 [1] thread_scalar at /home/alir/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
 [2] (::var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"})(::CuArray{Float64,2}, ::CuArray{Float64,2}) at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:805
 [3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"},var"#38#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#5"}})(::CuArray{Float64,2}, ::Vararg{CuArray{Float64,2},N} where N) at /home/alir/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [4] top-level scope at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:975

with

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)

and

Status `~/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v2.4.1
  [63c18a36] KernelAbstractions v0.5.3
  [bc48ee85] Tullio v0.2.12

ali-ramadhan commented 3 years ago

I should have also mentioned that the CPU version works

julia> A = rand(5, 5);

julia> B = rand(5, 5);

julia> @tullio C = A[i, j] / 2 + B[i, j] / 3
9.800577323796814

MasonProtter commented 3 years ago

It seems to be an error with a total scalar reduction. If there's a leftover index, it's okay:

julia> let U = cu(rand(3, 3, 3)), Δ = cu(rand(3))
           @tullio (max) out[i] := U[i, j, k] / Δ[k]
       end
3-element CuArray{Float32, 1}:
 2.7319489
 2.084404
 3.1774845

julia> let U = cu(rand(3, 3, 3)), Δ = cu(rand(3))
           @tullio (max) out := U[i, j, k] / Δ[k]
       end
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
Closest candidates are:
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  convert(::Type{T}, ::AbstractChar) where T<:Number at char.jl:180
  convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at multidimensional.jl:136
  ...
Stacktrace:
 [1] thread_scalar
   @ ~/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
 [2] (::var"#ℳ𝒶𝓀ℯ#201"{var"#𝒜𝒸𝓉!#192"})(U::CuArray{Float32, 3}, Δ::CuArray{Float32, 1})
   @ Main ~/.julia/packages/Tullio/bgqFi/src/macro.jl:805
 [3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#201"{var"#𝒜𝒸𝓉!#192"}, var"#1697#∇ℳ𝒶𝓀ℯ#200"{var"#∇𝒜𝒸𝓉!#196"}})(::CuArray{Float32, 3}, ::Vararg{Any, N} where N)
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [4] top-level scope
   @ REPL[86]:2

I guess one hack you could do as a stopgap before this is fixed is

julia> let U = cu(rand(3, 3, 3)), Δ = cu(rand(3))
           @tullio (max) out[l] := U[i, j, k] / Δ[k] l ∈ (1:1)
           sum(out)
       end
1.8847318f0

but it's annoying that you'd need to unwrap the array.

mcabbott commented 3 years ago

Yes, reductions to one scalar won't work on the GPU, I'm sorry.

The current KernelAbstractions.jl code is best for broadcasting-like operations, which are done in parallel over the output array. It is surely possible to do efficient scalar reductions on the GPU, but I have not looked into how. If someone can point me to a KA kernel for this, for some case, it might not be hard to make Tullio.jl generate something similar, in other cases.

A variant of Mason's stop-gap might be to write something like maximum(@tullio (max) out[k] := U[i, j, k] / Δ[k]), so that there is one nontrivial index to parallelise over. But I haven't tried this.

ali-ramadhan commented 3 years ago

Thank you for the replies! I was able to get it to work for my use case. Scales quite well on the CPU but not super great on the GPU (although certainly usable considering we now have features we didn't have before).

I learned that GPU reduction is quite non-trivial and that an optimized kernel can be ~30x faster than a naive implementation: https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

Not sure if such an optimized reduction kernel exists in Julia/CUDA.jl but one might appear in the future.

I'll close this issue since my question was answered. Feel free to reopen it for GPU scalar reductions (I'm also happy to open a new issue to track it).

AriMKatz commented 3 years ago

@tkf were you working on GPU reductions? I think I saw something on github, but may be wrong

MasonProtter commented 3 years ago

I think this issue should stay open as it’s not fixed.

piever commented 3 years ago

If someone can point me to a KA kernel for this, for some case, it might not be hard to make Tullio.jl generate something similar, in other cases.

There is an implementation here used to reduce transducers on the GPU. I think it implements the last strategy of the pdf linked above without the template metaprogramming bit.

mcabbott / Tullio.jl

Error in scalar reduction on the GPU #91