[Bug Report] Binary operations which require ttnn::repeat are very slow

dmakoviichuk-tt commented 1 month ago

Describe the bug Broadcast over batch dimension makes ops work much slower. We are using the in the optimizer step for each layer https://github.com/tenstorrent/TT-Tron/blob/main/sources/ttml/optimizers/sgd.cpp.

SGD performance is 10 times slower than pytorch cpu version. To Reproduce Just run any ops and make sure it require broadcast over dimension.

Expected behavior It should use fused broadcast op instead of the 2 calls.

Additional context @eyonland I assigned this ticket to you as elementwise owner. My expectation that you can drive it to the LLK team and make sure they and your team can add needed changes in both metal and ttnn level. If you cannot do it for some reason please let me know I'll find a new owner. It significantly reduces performance of our training code.

eyonland commented 1 month ago

@yan-zaretskiy outlined the algorithm necessary to implement this feature and improve performance. See https://hackmd.io/@segfault2024/rktH6lkgJe

jvasilje commented 1 month ago

@yan-zaretskiy any updates on this..? :)

jvasilje commented 1 month ago

@yan-zaretskiy @eyonland any updates here? I think we're treating this like OP improvement, and not a P0 bug?

eyonland commented 1 month ago

This turned out to be a much more complicated problem to solve. @yan-zaretskiy will be driving this work starting tomorrow.

tenstorrent / tt-metal

[Bug Report] Binary operations which require ttnn::repeat are very slow #13643