Open mkschleg opened 3 years ago
Also while increasing the number of threads (BLAS (for OMEinsum) and Julia threads (for Tullio)), Tullio starts approaching OMEinsum in performance.
Thanks for the example, this looks like the sort of thing they ought to be good at, although in the end... we will see.
I wonder what OMEinsum is doing here? (Edit -- judging by the memory, perhaps it's doing contract_WA_3
? Now I tried an older version, too.) I get quite a large difference between the two results. Cc @GiggleLiu?
julia> using Statistics, TensorCore
julia> r1 = @btime contract_WA($W, $a, $x);
13.796 ms (52 allocations: 258.41 KiB)
julia> r2 = @btime contract_WA_ein($W, $a, $x);
6.653 ms (102 allocations: 4.26 MiB) # OMEinsum v0.4
1.558 s (6 allocations: 256.28 KiB) # OMEinsum v0.3.3
julia> contract_WA_2(W, a::AbstractVector{Int}, x) = begin # a bad idea
@tullio mid[j, i, k] := W[a[k], i, j]
@tullio ret[i, k] := mid[j, i, k] * x[j, k]
end
contract_WA_2 (generic function with 1 method)
julia> @btime contract_WA_2($W, $a, $x);
109.873 ms (100 allocations: 256.25 MiB)
julia> contract_WA_3(W, a::AbstractVector{Int}, x) = begin # a better idea?
# @tullio mid[μ, i, k] := W[μ, i, j] * x[j, k]
mid = W ⊡ x
@tullio ret[i, k] := mid[a[k], i, k]
end
contract_WA_3 (generic function with 1 method)
julia> r3 = @btime contract_WA_3($W, $a, $x);
33.620 ms (54 allocations: 4.75 MiB) # all Tullio
22.286 ms (8 allocations: 4.75 MiB) # with ⊡, BLAS
julia> r1 ≈ r3
true
julia> r1 ≈ r2
false
julia> mean(abs.(r1 .- r2))
4362.551f0 # similar on both versions
All on an M1 mac, 1.6 + rosetta, 4 threads:
(jl_1fKnTY) pkg> st OMEinsum
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_1fKnTY/Project.toml`
[ebe7aa44] OMEinsum v0.4.0 # and I tried v0.4.2 too, similar results.
[ebe7aa44] OMEinsum v0.3.3 # pasted in below, above!
(jl_1fKnTY) pkg> st Tullio
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_1fKnTY/Project.toml`
[bc48ee85] Tullio v0.2.14
On my "i9 whatever", Tullio is 5x faster:
julia> using OMEinsum, Tullio
julia> contract_WA_ein(W, a::AbstractVector{<:Integer}, x) = begin
OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
end
contract_WA_ein (generic function with 1 method)
julia> contract_WA(W, a::AbstractVector{<:Integer}, x) = begin
@tullio ret[i, k] := W[a[k], i, j] * x[j, k]
end
contract_WA (generic function with 1 method)
julia> W = rand(Float32, 18, 1024, 1024);
julia> x = rand(Float32, 1024, 64);
julia> a = rand(Int32(1):Int32(18), 64);
julia> contract_WA(W,a,x) ≈ contract_WA_ein(W,a,x)
false
julia> @benchmark contract_WA($W, $a, $x)
samples: 483; evals/sample: 1; memory estimate: 268.59 KiB; allocs estimate: 246
ns
(1.0e7 - 1.03e7] ██████████████████████████████ 278
(1.03e7 - 1.06e7] ███████████████████▋181
(1.06e7 - 1.09e7] ▊6
(1.09e7 - 1.12e7] ▌4
(1.12e7 - 1.15e7] ▎2
(1.15e7 - 1.18e7] ▍3
(1.18e7 - 1.2e7 ] ▏1
(1.2e7 - 1.23e7] ▏1
(1.23e7 - 1.26e7] ▌4
(1.26e7 - 1.29e7] ▍3
Counts
min: 10.021 ms (0.00% GC); mean: 10.355 ms (0.05% GC); median: 10.295 ms (0.00% GC); max: 12.911 ms (0.00% GC).
julia> @benchmark contract_WA_ein($W, $a, $x)
samples: 377; evals/sample: 1; memory estimate: 4.26 MiB; allocs estimate: 99
ns
(1.29e7 - 1.31e7] ██████████████████████████████ 164
(1.31e7 - 1.33e7] █████████████▍73
(1.33e7 - 1.35e7] █████████████▎72
(1.35e7 - 1.37e7] ████▋25
(1.37e7 - 1.39e7] ███16
(1.39e7 - 1.41e7] █▍7
(1.41e7 - 1.43e7] █▍7
(1.43e7 - 1.45e7] ▊4
(1.45e7 - 1.48e7] █5
(1.48e7 - 1.5e7 ] ▊4
Counts
min: 12.857 ms (0.00% GC); mean: 13.260 ms (0.52% GC); median: 13.165 ms (0.00% GC); max: 14.969 ms (10.30% GC).
julia> versioninfo()
Julia Version 1.7.0-DEV.1124
Commit d18cf93bac* (2021-05-19 16:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
Using 4 threads:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(4);
julia> Threads.nthreads()
4
julia> using LoopVectorization
julia> using OMEinsum, Tullio
julia> contract_WA_ein(W, a::AbstractVector{<:Integer}, x) = begin
OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
end
contract_WA_ein (generic function with 1 method)
julia> contract_WA(W, a::AbstractVector{<:Integer}, x) = begin
@tullio ret[i, k] := W[a[k], i, j] * x[j, k]
end
contract_WA (generic function with 1 method)
julia> W = rand(Float32, 18, 1024, 1024);
julia> x = rand(Float32, 1024, 64);
julia> a = rand(Int32(1):Int32(18), 64);
julia> contract_WA(W,a,x) ≈ contract_WA_ein(W,a,x)
false
julia> @benchmark contract_WA($W, $a, $x)
samples: 721; evals/sample: 1; memory estimate: 258.59 KiB; allocs estimate: 52
ns
(6.87e6 - 6.98e6] ██████████████████████████████ 661
(6.98e6 - 7.1e6 ] ██▋57
(7.1e6 - 7.21e6] ▏1
(7.21e6 - 7.33e6] 0
(7.33e6 - 7.44e6] 0
(7.44e6 - 7.56e6] 0
(7.56e6 - 7.67e6] 0
(7.67e6 - 7.79e6] 0
(7.79e6 - 7.9e6 ] 0
(7.9e6 - 8.02e6] 0
(8.02e6 - 8.13e6] ▏2
Counts
min: 6.867 ms (0.00% GC); mean: 6.930 ms (0.04% GC); median: 6.918 ms (0.00% GC); max: 8.134 ms (13.24% GC).
julia> @benchmark contract_WA_ein($W, $a, $x)
samples: 443; evals/sample: 1; memory estimate: 4.26 MiB; allocs estimate: 105
ns
(1.11e7 - 1.13e7] ██████████████████████████████ 393
(1.13e7 - 1.16e7] █13
(1.16e7 - 1.18e7] ▌5
(1.18e7 - 1.21e7] ▎2
(1.21e7 - 1.23e7] █▏14
(1.23e7 - 1.26e7] ▉11
(1.26e7 - 1.28e7] ▎2
(1.28e7 - 1.31e7] ▏1
(1.31e7 - 1.34e7] ▏1
(1.34e7 - 1.36e7] ▏1
Counts
min: 11.090 ms (0.00% GC); mean: 11.304 ms (0.56% GC); median: 11.207 ms (0.00% GC); max: 13.602 ms (9.18% GC).
Using 1 thread:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1);
julia> Threads.nthreads()
1
julia> using LoopVectorization
julia> using OMEinsum, Tullio
julia> contract_WA_ein(W, a::AbstractVector{<:Integer}, x) = begin
OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
end
contract_WA_ein (generic function with 1 method)
julia> contract_WA(W, a::AbstractVector{<:Integer}, x) = begin
@tullio ret[i, k] := W[a[k], i, j] * x[j, k]
end
contract_WA (generic function with 1 method)
julia> W = rand(Float32, 18, 1024, 1024);
julia> x = rand(Float32, 1024, 64);
julia> a = rand(Int32(1):Int32(18), 64);
julia> contract_WA(W,a,x) ≈ contract_WA_ein(W,a,x)
false
julia> @benchmark contract_WA($W, $a, $x)
samples: 185; evals/sample: 1; memory estimate: 256.08 KiB; allocs estimate: 2
ns
(2.702e7 - 2.711e7] ██████████████████████████████ 169
(2.711e7 - 2.719e7] ██▌14
(2.719e7 - 2.728e7] 0
(2.728e7 - 2.736e7] ▎1
(2.736e7 - 2.745e7] 0
(2.745e7 - 2.753e7] 0
(2.753e7 - 2.762e7] 0
(2.762e7 - 2.77e7 ] 0
(2.77e7 - 2.779e7] ▎1
Counts
min: 27.020 ms (0.00% GC); mean: 27.069 ms (0.00% GC); median: 27.060 ms (0.00% GC); max: 27.789 ms (0.00% GC).
julia> @benchmark contract_WA_ein($W, $a, $x)
samples: 415; evals/sample: 1; memory estimate: 4.26 MiB; allocs estimate: 105
ns
(1.18e7 - 1.2e7 ] ████████████ 108
(1.2e7 - 1.21e7] ██████████████████████████████ 270
(1.21e7 - 1.22e7] █8
(1.22e7 - 1.24e7] ▋5
(1.24e7 - 1.25e7] 0
(1.25e7 - 1.26e7] 0
(1.26e7 - 1.28e7] 0
(1.28e7 - 1.29e7] 0
(1.29e7 - 1.31e7] █▉16
(1.31e7 - 1.32e7] █8
Counts
min: 11.829 ms (0.00% GC); mean: 12.064 ms (0.52% GC); median: 12.009 ms (0.00% GC); max: 13.196 ms (9.13% GC).
Hmm. Obviously they're not doing the same thing (results differ), but whatever OMEinsum is doing is >2x faster on a single thread on my computer.
I do not think OMEinsum
can handle this contraction pattern.
julia> using OMEinsum
julia> @macroexpand contract_WA_ein(W, a::AbstractVector{Int}, x) = begin
OMEinsum.@ein ret[i, k] := W[a[k], i, j] * x[j, k]
end
:(contract_WA_ein(W, a::AbstractVector{Int}, x) = begin
#= REPL[2]:1 =#
#= REPL[2]:2 =#
ret = OMEinsum.einsum(OMEinsum.EinCode{((1, 2, 3), (3, 4)), (2, 4)}(), (W, x))
end)
It is generating wrong contraction for you.
Maybe we should be restrictive to the @ein
interface to make it error.
Thanks, I was just digging into macroexpand too:
julia> @ein ret[i, k] := W[a[k], i, j] * x[j, k];
julia> ret ≈ @tullio ret_2[i, k] := W[α, i, j] * x[j, k]
true
or with the string macro:
julia> ret ≈ ein"αij, jk -> ik"(W,x)
true
julia> ein"A[k]ij, jk -> ik"(W,x)
ERROR: ArgumentError: indices ('i', 'j') invalid for tensor with ndims = 3
Trying to time the GPU things, I think you may need CUDA.@sync
to get real times? (Or maybe the mean is reliable?) I get the following:
julia> @btime contract_WA($W_gpu, $a_gpu, $x_gpu); # not sure this times correctly
610.445 μs (260 allocations: 7.75 KiB)
julia> r1_gpu = @btime CUDA.@sync contract_WA($W_gpu, $a_gpu, $x_gpu);
1.452 ms (730 allocations: 15.09 KiB)
julia> contract_WA_3(W, a::AbstractVector{<:Integer}, x) = begin
# @tullio mid[μ, i, k] := W[μ, i, j] * x[j, k]
mid = W ⊡ x
@tullio ret[i, k] := mid[a[k], i, k]
end
contract_WA_3 (generic function with 2 methods)
julia> r3_gpu = @btime CUDA.@sync contract_WA_3($W_gpu, $a_gpu, $x_gpu);
398.541 μs (359 allocations: 9.56 KiB)
julia> r1_gpu ≈ r3_gpu ≈ cu(contract_WA(W, a, x))
true
julia> @btime CUDA.@sync ($W_gpu ⊡ $x_gpu); # just the matmul part
242.413 μs (185 allocations: 3.38 KiB)
julia> CUDA.device()
CuDevice(0): NVIDIA Tesla V100-PCIE-32GB
So in this case the mostly-BLAS contract_WA_3
does beat Tullio, by a factor 4. While on the CPU it was slower, by a factor 2. Which means there's probably a lot of room for improvement. I think there is a lot of wizardry to writing good GPU stuff, but haven't been working on this.
@mcabbott You are probably correct about the CUDA.@sync. I'm not well versed in CUDAeise, so likely my simple benchmarking is pretty inadequate. Thank you for the tip.
In any case, wow. I did not even consider OMEinsum doing something wrong(ish?) here... I agree there seems to be some room for improvement in the CPU case for Tullio vs BLAS. I'll use contract_WA_3 w/ GPUs for now, and hopefully later on replace with a full tullio implementation.
I'm not sure if I should close this, because it seems like a good case for you to try and get Tullio to match performance w/ BLAS. But I would say my original inquiry is solved. Thank you so much for your help!
Unfortunately, while incorporating this into my system we get gradient code which does scalar operations on the gpu (so is painfully slow). The forward pass is fine, just the backward pass:
julia> using Flux, GPUArrays
julia> allowscalar(false)
julia> grads = gradient(params(W_gpu)) do
sum(contract_WA_3(W_gpu, a_gpu, x_gpu))
end
ERROR: Scalar getindex disallowed
On the CPU it sped up my current system by 3-fold though. So atleast that is good. Maybe this should be in a different issue though?
Again putting here because it is related:
when doing the gradient the kernel abstraction that is generated for the gradient is:
KernelAbstractions.@kernel function var"##🇨🇺#473"(𝛥mid, 𝛥a_gpu, 𝛥ℛ::Zygote.Fill{𝒯}, ℛ, mid, a_gpu, 𝒶𝓍i, 𝒶𝓍k, ♻️, 💀) where 𝒯
(i,) = @index(Global, NTuple)
𝛥ℛ_value = 𝛥ℛ.value
for k = 𝒶𝓍k
𝛥mid[a_gpu[k], i, k] = 𝛥mid[a_gpu[k], i, k] + 𝛥ℛ_value
end
end
The problem is almost definitely the internal loop over k. Now I guess we would want Tullio to somehow figure out that k also needs to be a global index. Right?
Thanks for digging. Am a little surprised this gives scalar operations, when the forward pass did not, but maybe, as you suggest, the problem is that the loop over k
isn't pulled out.
I'm almost sure it's marking k
as "unsafe right" meaning that it should not be threaded over in the gradient, because it appears on the right on an array of indices. And the concern would be that, if a_gpu
had repeats, different threads would write to the same address. But here, they can't, because k
also appears as an ordinary index, so it would in fact be safe to do k
in parallel. But teaching it to recognise this... will try to take a look.
I think the gradient code could do with some substantial re-working, also to solve #70. And perhaps to work out each array's gradient in a separate function, so that these can be @thunk
-ed for ChainRules.
That is very reasonable logic as this is a bit odd of an operation, and likely not used very often. I'm wondering if there is a way for me to overwrite this through tullio? My first instinct is to copy-paste what tullio does (+ work to make it usable on its own) and then rewrite the parts that need rewriting (i.e. just this loop I think, everything else looks ok).
Although, this is such a simple operation, I'm wondering if I can just use tullio for cpu and then rewrite the small pieces I need for the gpu operation.
Any suggestions?
Doing reasonably well on weird operations is the goal, since straightforward matmul (or gather) already have good implementations. But there is a long tail of odd things...
It would not be crazy to copy what the macro generates & edit it to be better. But this case might not be hard to fix, #180 is an attempt, but for now has one other weird test error.
I've been benchmarking packages for a tensor operation package to use w/ Flux. I noticed that OMEinsum was faster for a particular operation than Tullio, but Tullio was faster in the GPU with the exact same operation. While this isn't an issue as I could use dispatching to specialize, I would like to use a single package for this if possible for simplicity. Maybe there is something I can do to speed up the cpu version of Tullio? (A lot of my research uses small models almost entirely on the cpu, but I also do scaling experiments where appropriate using the GPU so this is a must).
Setup:
With results on CPU with a single thread (Intel i9 something something...):
And GPU (Nvidia 2080):