mcabbott / Tullio.jl

MIT License
598 stars 26 forks source link

Slower performance on GPU than CPU with Zygote #30

Open lpjiang97 opened 3 years ago

lpjiang97 commented 3 years ago

I'm experiencing some slower performances when taking gradients on CuArrays, than normal CPU Arrays, using Zygote. Here's an MWE (the Tullio operations implement a mixture linear regression):

using Tullio
using Zygote
using CUDA, KernelAbstractions

CUDA.allowscalar(false)

function prediction(V, W, r)
    @tullio mixture[a, b, d] := V[a, b, c] * W[d, c] 
    @tullio r̂[a, d] := mixture[a, b, d] * r[b, d] 
    return r̂
end

### CuArray version
V = CuArray(randn(15, 15, 3))
W = CuArray(randn(150, 3))

@time begin 
    grad = gradient(Params([V])) do 
        l = 0
        for t = 1:5
            r = CuArray(randn(15, 150))
            r̂ = prediction(V, W, r) 
            l += sum(r - r̂)
        end
        l
    end
end
# output:   0.304601 seconds (516.80 k allocations: 15.125 MiB)

If I replace all the arrays to CPU, I get 0.151464 seconds (209.86 k allocations: 13.321 MiB).

I'm wondering if this is expected? Or am I forgetting something when using Cuda arrays that slows down the performance?

mcabbott commented 3 years ago

I'm away from my GPU, but there are indeed some performance problems which I haven't got around to looking into. IIRC matmul!(a, b, c) from https://github.com/JuliaGPU/KernelAbstractions.jl/blob/master/examples/matmul.jl was faster than t_mul!(a, b, c) = @tullio c[i,j] = a[i,k] * b[k, j] although they should be more or less identical.

But it's also possible that randn(15, 15, 3) is just too small to benefit much. And finally, does CuArray(randn(15, 150)) make a CuArray{Float64}, or Float32?

lpjiang97 commented 3 years ago

I tried to use a larger size for the tensors, but the CuArray just takes too long to run. So it does look like the runtime gets worse with size. It was initialized as Float64, yes, but Float32 didn't make too much of a difference in my testing env (Ubuntu 18.04, CUDA 10.2)

mcabbott commented 3 years ago

I can reproduce this on google Colab now. And I ran Mason's test, but with a 3rd line for the KernelAbstractions matmul example linked above. Results are:

N = 100
  0.000479 seconds (184 CPU allocations: 5.250 KiB)  # @tullio
  0.000186 seconds (14 CPU allocations: 400 bytes)   # mul!, i.e. CuBLAS
  0.000285 seconds (233 CPU allocations: 5.531 KiB)  # KA's example code
N = 200
  0.000818 seconds (698 CPU allocations: 13.281 KiB)
  0.000214 seconds (14 CPU allocations: 400 bytes)
  0.000654 seconds (743 CPU allocations: 13.500 KiB)
N = 500
  0.005852 seconds (8.60 k CPU allocations: 136.688 KiB)
  0.000415 seconds (14 CPU allocations: 400 bytes)
  0.005665 seconds (8.51 k CPU allocations: 134.781 KiB)
N = 1000
  0.044196 seconds (66.68 k CPU allocations: 1.020 MiB)
  0.001492 seconds (14 CPU allocations: 400 bytes)
  0.041200 seconds (57.60 k CPU allocations: 916.719 KiB)
N = 2000
  0.319958 seconds (430.11 k CPU allocations: 6.573 MiB)
  0.007682 seconds (14 CPU allocations: 400 bytes)
  0.274934 seconds (384.18 k CPU allocations: 5.872 MiB)
N = 5000
  4.505518 seconds (6.34 M CPU allocations: 96.697 MiB, 1.09% gc time)
  0.084628 seconds (14 CPU allocations: 400 bytes)
  4.367660 seconds (6.16 M CPU allocations: 93.929 MiB, 0.48% gc time)

So it looks like @tullio is correctly producing the most naiive KA code (maybe with a few 100μs overhead) but this is very slow right now compared to optimised ones. Slower than CPU, too. I don't really know anything about GPUs so can't promise to fix that soon. I guess the present state is a bit like using @einsum on the CPU.

In your example, mixture[a, b, d] := V[a, b, c] * W[d, c] is ordinary matrix multiplication, with W transposed & V reshaped. And r̂[a, d] := mixture[a, b, d] * r[b, d] is batched multiplication on d, so can be written with NNlib.batched_mul after reshaping r. If this is the real operation, that's probably what you should do. (OMEinsum.jl might be able to write these reshapes for you, with the same notation.)

lpjiang97 commented 3 years ago

Thanks for testing! I did rewrite the code with reshape and batched_mul, which is faster on the GPUs. I guess I will leave this open for now :)

ChrisRackauckas commented 3 years ago

Is it possible to assume no aliasing? It would be good to add some @const declarations to the generated kernel.

mcabbott commented 3 years ago

Yes I meant to do that. Any idea if this speeds up the examples/matmul.jl linked above?

ChrisRackauckas commented 3 years ago

I don't know about that example, but I know we about halved the time in DiffEqGPU by adding those designations.

mcabbott commented 3 years ago

OK, good to hear.

Some possible improvements in #32, but CI isn't happy, and my computer with a GPU isn't answering the phone.

ChrisRackauckas commented 3 years ago

You should join the JuliaGPU group which has a bunch of GPU CIs for the community.

mcabbott commented 3 years ago

Yes I am a member, it's a great resource to have. https://gitlab.com/JuliaGPU/Tullio.jl/-/jobs/725368493 is my failed test. But further debugging may have to wait until I can try things locally.

jumerckx commented 3 years ago

More seems to be going on when using Zygote. In this example Tullio on gpu beats the cpu version in the forward pass, but the gradient calculation goes much slower. Don't Tullio gradients come down to generating a different Tullio expression?

using Tullio, KernelAbstractions, Flux, BenchmarkTools, CUDA

a = rand(100, 200, 3); b = rand(200, 100, 3);

f(a, b) = begin
    @tullio out[m, n, batch] := a[m, i, batch] * b[i, n, batch]
    sum(out)
end

# CPU
@btime f(a, b) # 601.801 μs (201 allocations: 248.08 KiB)

@btime Flux.gradient(a) do a
    f(a, b)
end # 2.499 ms (515 allocations: 1.18 MiB)

#GPU
a = gpu(a); b = gpu(b);

@btime CUDA.@sync f(a, b) # 238.800 μs (224 allocations: 7.50 KiB)

@btime CUDA.@sync Flux.gradient(a) do a
    f(a, b)
end # 4.745 ms (527 allocations: 20.33 KiB)
(Transformer) pkg> st
Status `C:\Users\jules\OneDrive - UGent\Documenten\code\Transformer\Project.toml`
  [fbb218c0] BSON v0.2.6
  [a4280ba5] BytePairEncoding v0.1.1
  [052768ef] CUDA v1.3.3
  [864edb3b] DataStructures v0.17.20
  [587475ba] Flux v0.11.1
  [63c18a36] KernelAbstractions v0.3.3
  [899adc3e] TensorBoardLogger v0.1.11
  [bc48ee85] Tullio v0.2.5 `C:\Users\jules\.julia\dev\Tullio`
  [e88e6eb3] Zygote v0.5.6
mcabbott commented 3 years ago

Yes, the gradients are computed by very similar loops, details from verbose=true:

┌ Info: symbolic gradients
│   inbody =
│    2-element Array{Any,1}:
│     :(𝛥a[m, i, batch] = 𝛥a[m, i, batch] + conj(conj(𝛥ℛ[m, n, batch]) * b[i, n, batch]))
└     :(𝛥b[i, n, batch] = 𝛥b[i, n, batch] + conj(conj(𝛥ℛ[m, n, batch]) * a[m, i, batch]))

The whole gradient calculation is about 4x slower than forward alone for me too, both with & without LoopVectorization. That doesn't seem unreasonable. But 20x slower on the GPU isn't good.

Compared to:

julia> using NNlib

julia> @btime batched_mul($a, $b); # about the same as avx=true, 146 μs
  137.310 μs (2 allocations: 234.45 KiB)

julia> fwd, back = Zygote.pullback(batched_mul, a, b); 

julia> @btime back($fwd);
  309.617 μs (5 allocations: 937.69 KiB)

julia> (309.617 + 137.310) / 137.310  # instead of 4x, roughly?
3.2548758284174495

julia> @btime Zygote.gradient(sum∘batched_mul, $a, $b);  # FillArrays go to generic_matmul
  10.543 ms (24 allocations: 1.15 MiB)

Adding nograd=b ought to speed up the calculation of the a gradient alone, if this is all you need. But here it appears to break @avx, am not sure why.

jumerckx commented 3 years ago

What actually causes the 4x difference between forward and backwards? Is it 4x more computation or is there a large overhead?

I don't really know what the variables in Tullio's grad expression mean or what there size is, but when I'd write these rules manually would they also run 20x slower on gpu you think? Or is there something else going on.

mcabbott commented 3 years ago

The naiive expectation (for square matrices) would be 3x, as gradient first runs the forward pass, and then the reverse pass has two similar multiplications, one to make 𝛥a and one to make 𝛥b. Here they aren't square here, the forward pass contracts the longer index i so allocates a smaller result, maybe that changes things a little.

I am not so sure why the GPU code is fast or slow in different cases, KernelAbstractions has a lot of black magic to map loops to GPU hardware, and maybe this case is genuinely harder, or maybe the forward pass is just closer to ordinary matrix multiplication which was presumably used as a test case when writing it. The code which @tullio produces is like the simplest example here: https://github.com/JuliaGPU/KernelAbstractions.jl/blob/master/examples/performance.jl and perhaps it needs to learn to generate things like the more complicated versions there. I haven't looked into this at all. (On the CPU, it already does some kind of memory tiling.)

Doing these races against matrix multiplication is a good way to find performance bugs, and I remain amazed how close to BLAS it sometimes gets on the CPU. But the intention was more to cover cases for which there isn't a great library function. The CuBLAS routine called by batched_mul is pretty quick! It's possible that @tullio could end up faster for something like permutedims + batched_mul, because permutedims is expensive and it can fuse them; I think the readme has a CPU example of this.

simeonschaub commented 3 years ago

Since the leading dimensions here are quite small, #82 should improve this a bit, since that enables threading over more than just one dimension. Will probably still be nowhere near batched_mul though.