auto switch from broadcast to broadcast! ?

Currently broadcast will allocate, although this is not a large allocation, but getting rid of it will result in 5x speed up at least in the following case (it comes from a more complicated use case, the for loop is actually some sampling procedure in my case):

y = rand(10)
x = rand(10)
dest = zeros(10)
_, record = record_allocations(+, y, x);
ctx = AutoPreallocation.new_replay_ctx(record)

julia> @benchmark for _ in 1:1000_00
           ctx.metadata.step[] = 1
           Cassette.overdub(ctx, +, $y, $x)
       end
BenchmarkTools.Trial: 
  memory estimate:  3.05 MiB
  allocs estimate:  100000
  --------------
  minimum time:     11.319 ms (0.00% GC)
  median time:      12.124 ms (0.00% GC)
  mean time:        12.644 ms (2.65% GC)
  maximum time:     21.369 ms (35.00% GC)
  --------------
  samples:          396
  evals/sample:     1

julia> @benchmark for _ in 1:1000_00
           broadcast!(+, $dest, $y, $x)
       end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.479 ms (0.00% GC)
  median time:      2.630 ms (0.00% GC)
  mean time:        2.680 ms (0.00% GC)
  maximum time:     4.824 ms (0.00% GC)
  --------------
  samples:          1864
  evals/sample:     1

I was thinking if there is a way to do this, but it seems broadcast will be lowered into broadcasted then materialize, while broadcast! will be lowered into broadcasted then materialize!, so I wrote the following overdub:

using Cassette
using AutoPreallocation: ReplayCtx, RecordingCtx

@inline function Cassette.overdub(ctx::RecordingCtx, ::typeof(Broadcast.materialize), bc::Broadcast.Broadcasted)
    ret = Broadcast.materialize(bc)
    AutoPreallocation.record_alloc!(ctx, ret)
    return ret
end

@inline function Cassette.overdub(ctx::ReplayCtx, f::typeof(Broadcast.materialize), bc::Broadcast.Broadcasted)
    scheduled = AutoPreallocation.next_scheduled_alloc!(ctx)
    return Broadcast.materialize!(scheduled, bc)
end

But... this allocates more memory! Any idea?

oxinabox / AutoPreallocation.jl

auto switch from broadcast to broadcast! ? #7