tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
303 stars 26 forks source link

Optimize bcast op #9545

Open johanna-rock-tt opened 2 weeks ago

johanna-rock-tt commented 2 weeks ago

Describe the bug bcast op is not optimized

The bcast variant sharded bcast_h was optimized to reuse in0 New timings: block sharded (8x8) cores, H=256 W=9182 0.0225 ms block sharded (8x8) cores, H=2048 W=9182 0.0359 ms

width sharded (8x8) cores, H=32 W=9182 0.005 ms width sharded (8x8) cores, H=2048 W=9182 0.025 ms

Other variants (sharded h / hw and interleaved variants) need to be revisited for similar optimizations.

bcast_h interleaved timings: interleaved (8x8) cores, H=256 W=9182 0.255 ms --> very slow for a bcast, interleaved bcast_h also not re-using in1.

Other possible optimizations to consider (e.g. for sharded bcast_h, but might also applicable for other variantes):

johanna-rock-tt commented 2 weeks ago

FYI: @TT-BrianLiu @shwetankTT @tt-aho @davorchap @uaydonat