Currently broadcast will allocate, although this is not a large allocation, but getting rid of it will result in 5x speed up at least in the following case (it comes from a more complicated use case, the for loop is actually some sampling procedure in my case):
y = rand(10)
x = rand(10)
dest = zeros(10)
_, record = record_allocations(+, y, x);
ctx = AutoPreallocation.new_replay_ctx(record)
julia> @benchmark for _ in 1:1000_00
ctx.metadata.step[] = 1
Cassette.overdub(ctx, +, $y, $x)
end
BenchmarkTools.Trial:
memory estimate: 3.05 MiB
allocs estimate: 100000
--------------
minimum time: 11.319 ms (0.00% GC)
median time: 12.124 ms (0.00% GC)
mean time: 12.644 ms (2.65% GC)
maximum time: 21.369 ms (35.00% GC)
--------------
samples: 396
evals/sample: 1
julia> @benchmark for _ in 1:1000_00
broadcast!(+, $dest, $y, $x)
end
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.479 ms (0.00% GC)
median time: 2.630 ms (0.00% GC)
mean time: 2.680 ms (0.00% GC)
maximum time: 4.824 ms (0.00% GC)
--------------
samples: 1864
evals/sample: 1
I was thinking if there is a way to do this, but it seems broadcast will be lowered into broadcasted then materialize, while broadcast! will be lowered into broadcasted then materialize!, so I wrote the following overdub:
using Cassette
using AutoPreallocation: ReplayCtx, RecordingCtx
@inline function Cassette.overdub(ctx::RecordingCtx, ::typeof(Broadcast.materialize), bc::Broadcast.Broadcasted)
ret = Broadcast.materialize(bc)
AutoPreallocation.record_alloc!(ctx, ret)
return ret
end
@inline function Cassette.overdub(ctx::ReplayCtx, f::typeof(Broadcast.materialize), bc::Broadcast.Broadcasted)
scheduled = AutoPreallocation.next_scheduled_alloc!(ctx)
return Broadcast.materialize!(scheduled, bc)
end
The 3.05MB allocation I think comes from the allocation of Broadcasted however, but shouldn't Broadcasted allocate on stack? there is no allocation in broadcast! somehow.
Currently
broadcast
will allocate, although this is not a large allocation, but getting rid of it will result in 5x speed up at least in the following case (it comes from a more complicated use case, the for loop is actually some sampling procedure in my case):I was thinking if there is a way to do this, but it seems
broadcast
will be lowered intobroadcasted
thenmaterialize
, whilebroadcast!
will be lowered intobroadcasted
thenmaterialize!
, so I wrote the following overdub:But... this allocates more memory! Any idea?