Open ali-ramadhan opened 3 years ago
I should have also mentioned that the CPU version works
julia> A = rand(5, 5);
julia> B = rand(5, 5);
julia> @tullio C = A[i, j] / 2 + B[i, j] / 3
9.800577323796814
It seems to be an error with a total scalar reduction. If there's a leftover index, it's okay:
julia> let U = cu(rand(3, 3, 3)), Ξ = cu(rand(3))
@tullio (max) out[i] := U[i, j, k] / Ξ[k]
end
3-element CuArray{Float32, 1}:
2.7319489
2.084404
3.1774845
julia> let U = cu(rand(3, 3, 3)), Ξ = cu(rand(3))
@tullio (max) out := U[i, j, k] / Ξ[k]
end
ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float32
Closest candidates are:
convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
convert(::Type{T}, ::AbstractChar) where T<:Number at char.jl:180
convert(::Type{T}, ::CartesianIndex{1}) where T<:Number at multidimensional.jl:136
...
Stacktrace:
[1] thread_scalar
@ ~/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
[2] (::var"#β³πΆπβ―#201"{var"#ππΈπ!#192"})(U::CuArray{Float32, 3}, Ξ::CuArray{Float32, 1})
@ Main ~/.julia/packages/Tullio/bgqFi/src/macro.jl:805
[3] (::Tullio.Eval{var"#β³πΆπβ―#201"{var"#ππΈπ!#192"}, var"#1697#ββ³πΆπβ―#200"{var"#βππΈπ!#196"}})(::CuArray{Float32, 3}, ::Vararg{Any, N} where N)
@ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
[4] top-level scope
@ REPL[86]:2
I guess one hack you could do as a stopgap before this is fixed is
julia> let U = cu(rand(3, 3, 3)), Ξ = cu(rand(3))
@tullio (max) out[l] := U[i, j, k] / Ξ[k] l β (1:1)
sum(out)
end
1.8847318f0
but it's annoying that you'd need to unwrap the array.
Yes, reductions to one scalar won't work on the GPU, I'm sorry.
The current KernelAbstractions.jl code is best for broadcasting-like operations, which are done in parallel over the output array. It is surely possible to do efficient scalar reductions on the GPU, but I have not looked into how. If someone can point me to a KA kernel for this, for some case, it might not be hard to make Tullio.jl generate something similar, in other cases.
A variant of Mason's stop-gap might be to write something like maximum(@tullio (max) out[k] := U[i, j, k] / Ξ[k])
, so that there is one nontrivial index to parallelise over. But I haven't tried this.
Thank you for the replies! I was able to get it to work for my use case. Scales quite well on the CPU but not super great on the GPU (although certainly usable considering we now have features we didn't have before).
I learned that GPU reduction is quite non-trivial and that an optimized kernel can be ~30x faster than a naive implementation: https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
Not sure if such an optimized reduction kernel exists in Julia/CUDA.jl but one might appear in the future.
I'll close this issue since my question was answered. Feel free to reopen it for GPU scalar reductions (I'm also happy to open a new issue to track it).
@tkf were you working on GPU reductions? I think I saw something on github, but may be wrong
I think this issue should stay open as itβs not fixed.
If someone can point me to a KA kernel for this, for some case, it might not be hard to make Tullio.jl generate something similar, in other cases.
There is an implementation here used to reduce transducers on the GPU. I think it implements the last strategy of the pdf linked above without the template metaprogramming bit.
I hope I'm not doing it wrong but I've been playing around with Tullio.jl (super neat package!) and I can't seem to get this simple expression to work on the GPU:
produces this stacktrace
with
and