substantial performance issues when combining Tullio and Zygote

tmbdev commented 2 years ago

I'm comparing performance of the built-in Dense layer with an equivalent Linear layer and a Dist layer. There are a couple of problems.

Full code, output, Manifest, Project: https://gist.github.com/12b7cfc21482419aa39e7e9dfbb7cc32

The most important definitions are this:

function linear_kernel(w, b, x)
    @tullio result[i, j] := w[k, i] * x[k, j]
    return result .+ b
end

struct Linear
    w
    b
end

function Linear(in::Integer, out::Integer; scale=1.0)
    return Linear(scale*randn(in, out) |> fp32, scale*randn(out) |> fp32)
end

Flux.@functor Linear

function(self::Linear)(x::AbstractArray)
    return linear_kernel(self.w, self.b, x)
end

...

model = Linear(784, 1024) |> gpu
function loss(x)
    return Flux.Losses.mse(model(x), y)
end 
CUDA.@time model(x);
CUDA.@time Zygote.pullback(model, x)[2](y);
CUDA.@time gradient(loss, x)

PROBLEM 1 -- CPU PERFORMANCE

The Tullio-based Linear layer is significantly slower than the built-in Dense layer, but that is perhaps understandable, and it's fast enough to be useful.

But while the pullback/gradient are within a factor of two of the forward computation for the Dense layer, they are both 20x slower for the Tullio-based layers.

PROBLEM 2 -- GPU PERFORMANCE

On GPU, the pattern is different: the Zygote pullback computation is within a factor of two of the forward computation, but the Zygote gradient computation is three orders of magnitude slower than the forward computation. It is baffling to me that a fast pullback can result in such a slow gradient computation.

mcabbott commented 2 years ago

Unfortunately this is a known problem, e.g. https://github.com/mcabbott/Tullio.jl/issues/30 . I have not worked much on making this faster since then.

tmbdev commented 2 years ago

OK, thanks. Turns out, I needed to use CUDA.@time; with that change, the data makes more sense and both pullback and gradient are about equally slow, about 20x - 50x of the forward computation. I've updated the bug description.

mcabbott commented 2 years ago

For the forward pass it has a loop nest like this (from verbose=2 output):

for j = 𝒶𝓍j
    for i = 𝒶𝓍i
        𝒜𝒸𝒸 = ♻️ === nothing ? zero(𝒯) : ℛ[i, j]  # If it breaks the iteration on k, it re-starts with value.
        for k = 𝒶𝓍k  # Innermost iteration is along 1st index
            𝒜𝒸𝒸 = 𝒜𝒸𝒸 + w[k, i] * x[k, j]  # Does not write to output in the accumulation loop over k
        end  
        ℛ[i, j] = 𝒜𝒸𝒸  # This is safe to multi-thread over i, j

...
# This is how it encodes threading over i,j but not k (262144 is a threshold for threading)
threader(𝒜𝒸𝓉!, storage_type(result, w, x), result, tuple(w, x), tuple(𝒶𝓍i, 𝒶𝓍j), tuple(𝒶𝓍k), +, 262144, nothing)

The gradient looks more like this:

for k = 𝒶𝓍k  # Outermost loop is the 1st index
    for j = 𝒶𝓍j
        for i = 𝒶𝓍i
            ℰ𝓍1 = conj(x[k, j])  # (CSE is generating these, although no help here)
            ℰ𝓍2 = 𝛥ℛ[i, j] * ℰ𝓍1
            ℰ𝓍3 = conj(w[k, i])
            ℰ𝓍4 = 𝛥ℛ[i, j] * ℰ𝓍3
            𝛥w[k, i] = 𝛥w[k, i] + ℰ𝓍2  # Accumulates directly into output arrays, at every step
            𝛥x[k, j] = 𝛥x[k, j] + ℰ𝓍4  # These are safe to multi-thread only along k, and it knows
        end
...        
∇threader(∇𝒜𝒸𝓉!, storage_type(𝛥w, 𝛥x, w, x), tuple(𝛥w, 𝛥x, 𝛥ℛ, ℛ, w, x), tuple(𝒶𝓍k), tuple(𝒶𝓍i, 𝒶𝓍j), 262144)

One thing which could be invesigated here is how inefficien this write-to-two-arrays loop nest is. The same expression with nograd=x generates only 𝛥w, and it appears that it's smart enough to then thread both i and k.

for i = 𝒶𝓍i
    for k = 𝒶𝓍k  # Now first index is 2nd loop
        for j = 𝒶𝓍j   # reduction index
            ℰ𝓍1 = conj(x[k, j])
            ℰ𝓍2 = 𝛥ℛ[i, j] * ℰ𝓍1
            𝛥w[k, i] = 𝛥w[k, i] + ℰ𝓍2  # Still reads & writes 𝛥w at every j. But safe to thread i & k
        end 
...
∇threader)(∇𝒜𝒸𝓉!, storage_type(𝛥w, 𝛥x, w, x), tuple(𝛥w, 𝛥x, 𝛥ℛ, ℛ, w, x), tuple(𝒶𝓍k, 𝒶𝓍i), tuple(𝒶𝓍j), 262144)

If the pullback is twice as fast that's no help, but if it's 10x faster that would motivate changing to generate a separate loop nest per gradient. Not sure I ever did this comparison. Another step would be to see how much it matters to change the pattern to something like this; harder to check as you'd have to edit the code it generates to try it:

for i = 𝒶𝓍i
    for k = 𝒶𝓍k
        acc = zero(T)
        for j = 𝒶𝓍j
            ℰ𝓍1 = conj(x[k, j])
            ℰ𝓍2 = 𝛥ℛ[i, j] * ℰ𝓍1
            acc += ℰ𝓍2
        end 
        𝛥w[k, i] + 𝛥w[k, i] + acc  # read & write once

Changing the macro to implement either of these is unfortunately not a small job. It's not as modular as it ought to be... I knew precisely how to do a better job around the time I stopped working much on this, of course.

Less ambitiously, notice that the order of loops changes in these expressions. There is some chance it's picking a bad one, and this would be relatively easy to solve. To time this you should probably @macroexpand1 @tullio ... and then just find the loops & edit them. Tullio does not know that arrays are column-major, the order is chosen based on which are reduction indices, etc. It's still possible that it makes a pessimal choice for the gradient. Whether KernelAbstractions cares or fixes this, I don't know.

tmbdev commented 2 years ago

Tullio returns symbolic derivatives that look like they are almost directly usable as Tullio expressions; why aren't those used directly?

I seem to get decent performance now with a simple rrule using Tullio itself:

function dist_kernel(w::AbstractArray{Float32}, x::AbstractArray{Float32})
    @tullio grad=false result[i, j] := (w[k, i] - x[k, j])^2 
end

function dist_kernel_pb(w, x)
    @tullio grad=false result[i, j] := (w[k, i] - x[k, j])^2
    return result, (Δresult) -> begin
        @tullio grad=false Δw[k, i] := + Δresult[i, j] * 2 * (w[k, i] - x[k, j])
        @tullio grad=false Δx[k, i] := - Δresult[i, j] * 2 * (w[k, i] - x[k, j])
        return (ChainRules.NoTangent(), Δw, Δx)
    end
end

ChainRules.rrule(::typeof(dist_kernel), w, x) = dist_kernel_pb(w, x)

mcabbott commented 2 years ago

I think the reason it doesn't directly write the gradients as separate calls is that it doesn't know how to solve for the shifts necessary for convolutions etc. The present re-writing just accumulates into arrays at weird indices, but it will tend to get things wrong if used more directly, like in https://github.com/mcabbott/Tullio.jl/issues/136 . Maybe that could be solved somehow.

tmbdev commented 2 years ago

Wouldn't that be fixable by either disallowing index computations on the LHS, or at least special casing for the case where there are no index computations on the LHS and warning about the other case?

Is there some formal specification/analysis of the expressions that Tullio accepts? It's perhaps not surprising that it's difficult to reason about these expressions otherwise.

mcabbott commented 2 years ago

They should be disallowed on the LHS, that they aren't is a bug. But the gradient calculation always wants to move them to the left, as you're now writing into the gradient of an input array.

Agree you could special-case the easy cases. So far this package hasn't done that at all, just to avoid duplication... it's already far too complicated.

I have not tried to write anything up. The whole package was an exploration of what's possible / what's fast. Ideally v2 would have a more fixed scope, and could be much cleaner, but I'd need to clone myself first... I believe https://github.com/mcabbott/Tullio.jl/issues/70 is a logic mistake in how it handles min/max gradients, I had a fix but didn't get to implementing it. (Nor a gradient for reductions over *.)

ToucheSir commented 2 years ago

I was able to speed up the CPU variants ~5x by enabling threading on the Julia side (i.e. julia -t auto) to match the BLAS threadpool Dense is using. Also using BenchmarkTools.jl and CUDA.@sync for more reliable timings:

=== Dense cpu
  4.393 ms (4 allocations: 8.00 MiB)
  12.858 ms (16 allocations: 14.13 MiB)
  18.108 ms (68 allocations: 46.13 MiB)
=== Dense gpu
  1.725 ms (113 allocations: 5.61 KiB)
  5.804 ms (214 allocations: 8.70 KiB)
  7.904 ms (630 allocations: 33.77 KiB)
=== Linear cpu
  22.505 ms (140 allocations: 8.01 MiB)
  333.240 ms (388 allocations: 14.15 MiB)
  336.375 ms (448 allocations: 46.15 MiB)
=== Linear gpu
  141.146 ms (194 allocations: 9.50 KiB)
  948.487 ms (454 allocations: 21.42 KiB)
  950.427 ms (876 allocations: 46.17 KiB)
=== Dist cpu
  24.443 ms (142 allocations: 12.01 MiB)
  400.947 ms (396 allocations: 22.15 MiB)
  403.849 ms (453 allocations: 54.15 MiB)
=== Dist gpu
  142.127 ms (226 allocations: 11.23 KiB)
  1.057 s (521 allocations: 24.95 KiB)
  1.059 s (942 allocations: 49.67 KiB)

Also, perhaps this and/or #30 should be updated to reflect that the behaviour occurs with all reverse-mode ADs Tullio is compatible with? Then people who might be using e.g. ReverseDiff wouldn't accidentally miss it.

tmbdev commented 2 years ago

@ToucheSir Oops, sorry, I thought I had updated the bug report with CUDA.@sync etc. I have updated the GIST link now. In any case, I still get a 40x ratio of dist-gpu-gradient vs dist-gpu-forward on my A6000. It looks like you only get a 7.5x ratio on your GPU. I wonder whether that's due to performance differences of the GPUs.

ToucheSir commented 2 years ago

Make sure you're using the macros from https://github.com/JuliaCI/BenchmarkTools.jl instead of @time to remove noise and get a better statistical estimate of the true time (whether that's the mean, median or minimum). CUDA.@time is still nice for measuring GPU-side memory allocations, though. But yes, my GPU is very underpowered.

mcabbott / Tullio.jl

substantial performance issues when combining Tullio and Zygote #154

PROBLEM 1 -- CPU PERFORMANCE

PROBLEM 2 -- GPU PERFORMANCE