Closed dcrc2 closed 2 years ago
@dcrc2 I've just realised: I don't see code in this PR to test/benchmark the elementwise calls? Can you make another PR if so to show that?
@dcrc2 I've just realised: I don't see code in this PR to test/benchmark the elementwise calls? Can you make another PR if so to show that?
The existing benchmarks for vrelu3
now run this code. Do you mean that you'd like to be able to compare it to the previous method (where ksc generated the code for map
)? I have some results for this above, but we could maintain both versions as separate benchmarks if we wanted.
Oh of course, sorry for the noise.
When the ks entry point is an elementwise function, generate its code in python rather than via ksc. The main purpose of this is to allow the loop to be parallelized in future (for GPU). It also avoids copying the output tensor.
Current limitations:
It doesn't look like it will be hard to generalize either of those things.
The following code is generated for
vrelu3
:As things stand, this improves the performance of
backwards
but notforwards
. (Calling[sufrev relu3]
on each element is better optimized than doing one loop of[suffwdpass relu3]
and another loop of[sufrevpass relu3]
.)Before:
After: