Open dmakoviichuk-tt opened 1 month ago
@yan-zaretskiy outlined the algorithm necessary to implement this feature and improve performance. See https://hackmd.io/@segfault2024/rktH6lkgJe
@yan-zaretskiy any updates on this..? :)
@yan-zaretskiy @eyonland any updates here? I think we're treating this like OP improvement, and not a P0 bug?
This turned out to be a much more complicated problem to solve. @yan-zaretskiy will be driving this work starting tomorrow.
Describe the bug Broadcast over batch dimension makes ops work much slower. We are using the in the optimizer step for each layer https://github.com/tenstorrent/TT-Tron/blob/main/sources/ttml/optimizers/sgd.cpp.
SGD performance is 10 times slower than pytorch cpu version. To Reproduce Just run any ops and make sure it require broadcast over dimension.
Expected behavior It should use fused broadcast op instead of the 2 calls.
Additional context @eyonland I assigned this ticket to you as elementwise owner. My expectation that you can drive it to the LLK team and make sure they and your team can add needed changes in both metal and ttnn level. If you cannot do it for some reason please let me know I'll find a new owner. It significantly reduces performance of our training code.