Fused contraction + gather operation, CPU & especially GPU

mkschleg commented 3 years ago

I've been benchmarking packages for a tensor operation package to use w/ Flux. I noticed that OMEinsum was faster for a particular operation than Tullio, but Tullio was faster in the GPU with the exact same operation. While this isn't an issue as I could use dispatching to specialize, I would like to use a single package for this if possible for simplicity. Maybe there is something I can do to speed up the cpu version of Tullio? (A lot of my research uses small models almost entirely on the cpu, but I also do scaling experiments where appropriate using the GPU so this is a must).

Setup:

using CUDA, CUDAKernels
using LoopVectorization, KernelAbstractions
using Flux, Tullio, OMEinsum

contract_WA(W, a::AbstractVector{Int}, x) = begin
    @tullio ret[i, k] := W[a[k], i, j] * x[j, k]
end

contract_WA_ein(W, a::AbstractVector{Int}, x) = begin
    OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
end

using BenchmarkHistograms, LinearAlgebra

LinearAlgebra.BLAS.set_num_threads(Threads.nthreads())

W = rand(Float32, 18, 1024, 1024)
x = rand(Float32, 1024, 64)
a = rand(1:18, 64)

W_gpu = W |> gpu
x_gpu = x |> gpu
a_gpu = a|> gpu # needed to not use scalar operations w/ Tullio

With results on CPU with a single thread (Intel i9 something something...):

julia> @benchmark contract_WA($W, $a, $x)
BenchmarkTools.Trial:
  memory estimate:  256.19 KiB
  allocs estimate:  8
  --------------
  minimum time:     48.804 ms (0.00% GC)
  median time:      48.873 ms (0.00% GC)
  mean time:        48.882 ms (0.00% GC)
  maximum time:     49.189 ms (0.00% GC)
  --------------
  samples:          103
  evals/sample:     1

julia> @benchmark contract_WA_ein($W, $a, $x)
BenchmarkTools.Trial:
  memory estimate:  4.26 MiB
  allocs estimate:  103
  --------------
  minimum time:     11.698 ms (0.00% GC)
  median time:      11.774 ms (0.00% GC)
  mean time:        11.854 ms (0.36% GC)
  maximum time:     13.448 ms (7.07% GC)
  --------------
  samples:          422
  evals/sample:     1

And GPU (Nvidia 2080):

julia> @benchmark contract_WA($W_gpu, $a_gpu, $x_gpu)
BenchmarkTools.Trial:
  memory estimate:  7.97 KiB
  allocs estimate:  274
  --------------
  minimum time:     261.116 μs (0.00% GC)
  median time:      1.477 ms (0.00% GC)
  mean time:        1.492 ms (0.41% GC)
  maximum time:     19.496 ms (52.51% GC)
  --------------
  samples:          3346
  evals/sample:     1

julia> @benchmark contract_WA_ein($W_gpu, $a_gpu, $x_gpu)
BenchmarkTools.Trial:
  memory estimate:  8.09 KiB
  allocs estimate:  203
  --------------
  minimum time:     33.598 μs (0.00% GC)
  median time:      3.208 ms (0.00% GC)
  mean time:        2.940 ms (0.07% GC)
  maximum time:     8.422 ms (44.48% GC)
  --------------
  samples:          1700
  evals/sample:     1

mkschleg commented 3 years ago

Also while increasing the number of threads (BLAS (for OMEinsum) and Julia threads (for Tullio)), Tullio starts approaching OMEinsum in performance.

mcabbott commented 3 years ago

Thanks for the example, this looks like the sort of thing they ought to be good at, although in the end... we will see.

I wonder what OMEinsum is doing here? (Edit -- judging by the memory, perhaps it's doing contract_WA_3? Now I tried an older version, too.) I get quite a large difference between the two results. Cc @GiggleLiu?

julia> using Statistics, TensorCore

julia> r1 = @btime contract_WA($W, $a, $x);
  13.796 ms (52 allocations: 258.41 KiB)

julia> r2 = @btime contract_WA_ein($W, $a, $x);
  6.653 ms (102 allocations: 4.26 MiB)  # OMEinsum v0.4
  1.558 s (6 allocations: 256.28 KiB)   # OMEinsum v0.3.3 

julia> contract_WA_2(W, a::AbstractVector{Int}, x) = begin  # a bad idea
           @tullio mid[j, i, k] := W[a[k], i, j]
           @tullio ret[i, k] := mid[j, i, k] * x[j, k]
       end
contract_WA_2 (generic function with 1 method)

julia> @btime contract_WA_2($W, $a, $x);
  109.873 ms (100 allocations: 256.25 MiB)

julia> contract_WA_3(W, a::AbstractVector{Int}, x) = begin  # a better idea?
           # @tullio mid[μ, i, k] := W[μ, i, j] * x[j, k]
           mid = W ⊡ x
           @tullio ret[i, k] := mid[a[k], i, k]
       end
contract_WA_3 (generic function with 1 method)

julia> r3 = @btime contract_WA_3($W, $a, $x);
  33.620 ms (54 allocations: 4.75 MiB) # all Tullio
  22.286 ms (8 allocations: 4.75 MiB)  # with ⊡, BLAS

julia> r1 ≈ r3
true

julia> r1 ≈ r2
false

julia> mean(abs.(r1 .- r2))
4362.551f0  # similar on both versions

All on an M1 mac, 1.6 + rosetta, 4 threads:

(jl_1fKnTY) pkg> st OMEinsum
      Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_1fKnTY/Project.toml`
  [ebe7aa44] OMEinsum v0.4.0  # and I tried v0.4.2 too, similar results.
  [ebe7aa44] OMEinsum v0.3.3  # pasted in below, above!

(jl_1fKnTY) pkg> st Tullio
      Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_1fKnTY/Project.toml`
  [bc48ee85] Tullio v0.2.14

chriselrod commented 3 years ago

On my "i9 whatever", Tullio is 5x faster:

julia> using OMEinsum, Tullio

julia> contract_WA_ein(W, a::AbstractVector{<:Integer}, x) = begin
           OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
       end
contract_WA_ein (generic function with 1 method)

julia> contract_WA(W, a::AbstractVector{<:Integer}, x) = begin
           @tullio ret[i, k] := W[a[k], i, j] * x[j, k]
       end
contract_WA (generic function with 1 method)

julia> W = rand(Float32, 18, 1024, 1024);

julia> x = rand(Float32, 1024, 64);

julia> a = rand(Int32(1):Int32(18), 64);

julia> contract_WA(W,a,x) ≈ contract_WA_ein(W,a,x)
false

julia> @benchmark contract_WA($W, $a, $x)
samples: 483; evals/sample: 1; memory estimate: 268.59 KiB; allocs estimate: 246
ns

 (1.0e7  - 1.03e7]  ██████████████████████████████ 278
 (1.03e7 - 1.06e7]  ███████████████████▋181
 (1.06e7 - 1.09e7]  ▊6
 (1.09e7 - 1.12e7]  ▌4
 (1.12e7 - 1.15e7]  ▎2
 (1.15e7 - 1.18e7]  ▍3
 (1.18e7 - 1.2e7 ]  ▏1
 (1.2e7  - 1.23e7]  ▏1
 (1.23e7 - 1.26e7]  ▌4
 (1.26e7 - 1.29e7]  ▍3

                  Counts

min: 10.021 ms (0.00% GC); mean: 10.355 ms (0.05% GC); median: 10.295 ms (0.00% GC); max: 12.911 ms (0.00% GC).

julia> @benchmark contract_WA_ein($W, $a, $x)
samples: 377; evals/sample: 1; memory estimate: 4.26 MiB; allocs estimate: 99
ns

 (1.29e7 - 1.31e7]  ██████████████████████████████ 164
 (1.31e7 - 1.33e7]  █████████████▍73
 (1.33e7 - 1.35e7]  █████████████▎72
 (1.35e7 - 1.37e7]  ████▋25
 (1.37e7 - 1.39e7]  ███16
 (1.39e7 - 1.41e7]  █▍7
 (1.41e7 - 1.43e7]  █▍7
 (1.43e7 - 1.45e7]  ▊4
 (1.45e7 - 1.48e7]  █5
 (1.48e7 - 1.5e7 ]  ▊4

                  Counts

min: 12.857 ms (0.00% GC); mean: 13.260 ms (0.52% GC); median: 13.165 ms (0.00% GC); max: 14.969 ms (10.30% GC).

julia> versioninfo()
Julia Version 1.7.0-DEV.1124
Commit d18cf93bac* (2021-05-19 16:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

Using 4 threads:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(4);

julia> Threads.nthreads()
4

julia> using LoopVectorization

julia> using OMEinsum, Tullio

julia> contract_WA_ein(W, a::AbstractVector{<:Integer}, x) = begin
           OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
       end
contract_WA_ein (generic function with 1 method)

julia> contract_WA(W, a::AbstractVector{<:Integer}, x) = begin
           @tullio ret[i, k] := W[a[k], i, j] * x[j, k]
       end
contract_WA (generic function with 1 method)

julia> W = rand(Float32, 18, 1024, 1024);

julia> x = rand(Float32, 1024, 64);

julia> a = rand(Int32(1):Int32(18), 64);

julia> contract_WA(W,a,x) ≈ contract_WA_ein(W,a,x)
false

julia> @benchmark contract_WA($W, $a, $x)
samples: 721; evals/sample: 1; memory estimate: 258.59 KiB; allocs estimate: 52
ns

 (6.87e6 - 6.98e6]  ██████████████████████████████ 661
 (6.98e6 - 7.1e6 ]  ██▋57
 (7.1e6  - 7.21e6]  ▏1
 (7.21e6 - 7.33e6]   0
 (7.33e6 - 7.44e6]   0
 (7.44e6 - 7.56e6]   0
 (7.56e6 - 7.67e6]   0
 (7.67e6 - 7.79e6]   0
 (7.79e6 - 7.9e6 ]   0
 (7.9e6  - 8.02e6]   0
 (8.02e6 - 8.13e6]  ▏2

                  Counts

min: 6.867 ms (0.00% GC); mean: 6.930 ms (0.04% GC); median: 6.918 ms (0.00% GC); max: 8.134 ms (13.24% GC).

julia> @benchmark contract_WA_ein($W, $a, $x)
samples: 443; evals/sample: 1; memory estimate: 4.26 MiB; allocs estimate: 105
ns

 (1.11e7 - 1.13e7]  ██████████████████████████████ 393
 (1.13e7 - 1.16e7]  █13
 (1.16e7 - 1.18e7]  ▌5
 (1.18e7 - 1.21e7]  ▎2
 (1.21e7 - 1.23e7]  █▏14
 (1.23e7 - 1.26e7]  ▉11
 (1.26e7 - 1.28e7]  ▎2
 (1.28e7 - 1.31e7]  ▏1
 (1.31e7 - 1.34e7]  ▏1
 (1.34e7 - 1.36e7]  ▏1

                  Counts

min: 11.090 ms (0.00% GC); mean: 11.304 ms (0.56% GC); median: 11.207 ms (0.00% GC); max: 13.602 ms (9.18% GC).

Using 1 thread:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1);

julia> Threads.nthreads()
1

julia> using LoopVectorization

julia> using OMEinsum, Tullio

julia> contract_WA_ein(W, a::AbstractVector{<:Integer}, x) = begin
           OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
       end
contract_WA_ein (generic function with 1 method)

julia> contract_WA(W, a::AbstractVector{<:Integer}, x) = begin
           @tullio ret[i, k] := W[a[k], i, j] * x[j, k]
       end
contract_WA (generic function with 1 method)

julia> W = rand(Float32, 18, 1024, 1024);

julia> x = rand(Float32, 1024, 64);

julia> a = rand(Int32(1):Int32(18), 64);

julia> contract_WA(W,a,x) ≈ contract_WA_ein(W,a,x)
false

julia> @benchmark contract_WA($W, $a, $x)
samples: 185; evals/sample: 1; memory estimate: 256.08 KiB; allocs estimate: 2
ns

 (2.702e7 - 2.711e7]  ██████████████████████████████ 169
 (2.711e7 - 2.719e7]  ██▌14
 (2.719e7 - 2.728e7]   0
 (2.728e7 - 2.736e7]  ▎1
 (2.736e7 - 2.745e7]   0
 (2.745e7 - 2.753e7]   0
 (2.753e7 - 2.762e7]   0
 (2.762e7 - 2.77e7 ]   0
 (2.77e7  - 2.779e7]  ▎1

                  Counts

min: 27.020 ms (0.00% GC); mean: 27.069 ms (0.00% GC); median: 27.060 ms (0.00% GC); max: 27.789 ms (0.00% GC).

julia> @benchmark contract_WA_ein($W, $a, $x)
samples: 415; evals/sample: 1; memory estimate: 4.26 MiB; allocs estimate: 105
ns

 (1.18e7 - 1.2e7 ]  ████████████ 108
 (1.2e7  - 1.21e7]  ██████████████████████████████ 270
 (1.21e7 - 1.22e7]  █8
 (1.22e7 - 1.24e7]  ▋5
 (1.24e7 - 1.25e7]   0
 (1.25e7 - 1.26e7]   0
 (1.26e7 - 1.28e7]   0
 (1.28e7 - 1.29e7]   0
 (1.29e7 - 1.31e7]  █▉16
 (1.31e7 - 1.32e7]  █8

                  Counts

min: 11.829 ms (0.00% GC); mean: 12.064 ms (0.52% GC); median: 12.009 ms (0.00% GC); max: 13.196 ms (9.13% GC).

Hmm. Obviously they're not doing the same thing (results differ), but whatever OMEinsum is doing is >2x faster on a single thread on my computer.

GiggleLiu commented 3 years ago

I do not think OMEinsum can handle this contraction pattern.

julia> using OMEinsum

julia> @macroexpand contract_WA_ein(W, a::AbstractVector{Int}, x) = begin
           OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
       end

:(contract_WA_ein(W, a::AbstractVector{Int}, x) = begin
          #= REPL[2]:1 =#
          #= REPL[2]:2 =#
          ret = OMEinsum.einsum(OMEinsum.EinCode{((1, 2, 3), (3, 4)), (2, 4)}(), (W, x))
      end)

It is generating wrong contraction for you. Maybe we should be restrictive to the @ein interface to make it error.

mcabbott commented 3 years ago

Thanks, I was just digging into macroexpand too:

julia> @ein ret[i, k] := W[a[k], i, j] * x[j, k];

julia> ret ≈ @tullio ret_2[i, k] := W[α, i, j] * x[j, k]
true

or with the string macro:

julia> ret ≈ ein"αij, jk -> ik"(W,x)
true

julia> ein"A[k]ij, jk -> ik"(W,x)
ERROR: ArgumentError: indices ('i', 'j') invalid for tensor with ndims = 3

mcabbott commented 3 years ago

Trying to time the GPU things, I think you may need CUDA.@sync to get real times? (Or maybe the mean is reliable?) I get the following:

julia> @btime contract_WA($W_gpu, $a_gpu, $x_gpu); # not sure this times correctly
  610.445 μs (260 allocations: 7.75 KiB)

julia> r1_gpu = @btime CUDA.@sync contract_WA($W_gpu, $a_gpu, $x_gpu);
  1.452 ms (730 allocations: 15.09 KiB)

julia> contract_WA_3(W, a::AbstractVector{<:Integer}, x) = begin 
           # @tullio mid[μ, i, k] := W[μ, i, j] * x[j, k]
           mid = W ⊡ x
           @tullio ret[i, k] := mid[a[k], i, k]
       end
contract_WA_3 (generic function with 2 methods)

julia> r3_gpu = @btime CUDA.@sync contract_WA_3($W_gpu, $a_gpu, $x_gpu);
  398.541 μs (359 allocations: 9.56 KiB)

julia> r1_gpu ≈ r3_gpu ≈ cu(contract_WA(W, a, x))
true

julia> @btime CUDA.@sync ($W_gpu ⊡ $x_gpu);  # just the matmul part
  242.413 μs (185 allocations: 3.38 KiB)

julia> CUDA.device()
CuDevice(0): NVIDIA Tesla V100-PCIE-32GB

So in this case the mostly-BLAS contract_WA_3 does beat Tullio, by a factor 4. While on the CPU it was slower, by a factor 2. Which means there's probably a lot of room for improvement. I think there is a lot of wizardry to writing good GPU stuff, but haven't been working on this.

mkschleg commented 3 years ago

@mcabbott You are probably correct about the CUDA.@sync. I'm not well versed in CUDAeise, so likely my simple benchmarking is pretty inadequate. Thank you for the tip.

In any case, wow. I did not even consider OMEinsum doing something wrong(ish?) here... I agree there seems to be some room for improvement in the CPU case for Tullio vs BLAS. I'll use contract_WA_3 w/ GPUs for now, and hopefully later on replace with a full tullio implementation.

I'm not sure if I should close this, because it seems like a good case for you to try and get Tullio to match performance w/ BLAS. But I would say my original inquiry is solved. Thank you so much for your help!

mkschleg commented 3 years ago

Unfortunately, while incorporating this into my system we get gradient code which does scalar operations on the gpu (so is painfully slow). The forward pass is fine, just the backward pass:

julia> using Flux, GPUArrays
julia> allowscalar(false)
julia> grads = gradient(params(W_gpu)) do
    sum(contract_WA_3(W_gpu, a_gpu, x_gpu))
end
ERROR: Scalar getindex disallowed

On the CPU it sped up my current system by 3-fold though. So atleast that is good. Maybe this should be in a different issue though?

mkschleg commented 3 years ago

Again putting here because it is related:

when doing the gradient the kernel abstraction that is generated for the gradient is:

KernelAbstractions.@kernel function var"##🇨🇺#473"(𝛥mid, 𝛥a_gpu, 𝛥ℛ::Zygote.Fill{𝒯}, ℛ, mid, a_gpu, 𝒶𝓍i, 𝒶𝓍k, ♻️, 💀) where 𝒯
    (i,) = @index(Global, NTuple)
    𝛥ℛ_value = 𝛥ℛ.value
    for k = 𝒶𝓍k
        𝛥mid[a_gpu[k], i, k] = 𝛥mid[a_gpu[k], i, k] + 𝛥ℛ_value
    end
end

The problem is almost definitely the internal loop over k. Now I guess we would want Tullio to somehow figure out that k also needs to be a global index. Right?

mcabbott commented 3 years ago

Thanks for digging. Am a little surprised this gives scalar operations, when the forward pass did not, but maybe, as you suggest, the problem is that the loop over k isn't pulled out.

I'm almost sure it's marking k as "unsafe right" meaning that it should not be threaded over in the gradient, because it appears on the right on an array of indices. And the concern would be that, if a_gpu had repeats, different threads would write to the same address. But here, they can't, because k also appears as an ordinary index, so it would in fact be safe to do k in parallel. But teaching it to recognise this... will try to take a look.

I think the gradient code could do with some substantial re-working, also to solve #70. And perhaps to work out each array's gradient in a separate function, so that these can be @thunk-ed for ChainRules.

mkschleg commented 3 years ago

That is very reasonable logic as this is a bit odd of an operation, and likely not used very often. I'm wondering if there is a way for me to overwrite this through tullio? My first instinct is to copy-paste what tullio does (+ work to make it usable on its own) and then rewrite the parts that need rewriting (i.e. just this loop I think, everything else looks ok).

Although, this is such a simple operation, I'm wondering if I can just use tullio for cpu and then rewrite the small pieces I need for the gpu operation.

Any suggestions?

mcabbott commented 3 years ago

Doing reasonably well on weird operations is the goal, since straightforward matmul (or gather) already have good implementations. But there is a long tail of odd things...

It would not be crazy to copy what the macro generates & edit it to be better. But this case might not be hard to fix, #180 is an attempt, but for now has one other weird test error.

mcabbott / Tullio.jl