Open dmakoviichuk-tt opened 2 weeks ago
@dmakoviichuk-tt Can you provide details on how you collected the performance?
@dmakoviichuk-tt , my assumption is that you swapped out the ttnn function with a direct pytorch function that runs on host and saw the overall perf difference. If during training a small tensor is called multiple times I can see how the CPU branching predictions would be blazingly fast compared to pushing the tensor on and off device. Did you measure it by overall performance of the training time?
@dmakoviichuk-tt , what was the size of the Tensor?
@umadevimcw with timer. @eyonland it doesn't matter. As I mentioned in optimizer we need to multiply gradients by scalars. Gradients have shape of the weights so it could be like (1,1, 512, 1024). But we are using this ops not only with gradients, in this case shape could be: (64,1,256,2048) for example.
@eyonland it is obviously really bad and slow code for the very simple operation like this. Why ask questions like that? I've already demonstrated and showed two problems why it is so slow.
@dmakoviichuk-tt , my assumption is that you swapped out the ttnn function with a direct pytorch function that runs on host and saw the overall perf difference.
You assumption is wrong in all possible ways. How can I swap something with pytorch call if I don't use pytorch?
Please be respectful to your colleagues. Right now it looks like you are trying to avoid fixing that obvious issue!
Sorry for the misunderstanding here. I was trying to figure out how you measured it originally.
We absolutely should be passing a scalar as a runtime arg and never ever create a tensor. My time has been stretched thin on this issue and as well as rebuilding eltwise ops to properly handle broadcasting given that bcast does not adequately do this and the use of repeat is absolutely terrible given we make multiple calls.
Describe the bug Every time we call binary op with scalar we create tensor from this scalar and then also call ttnn::repeat:
We are using the in the optimizer step for each layer https://github.com/tenstorrent/TT-Tron/blob/main/sources/ttml/optimizers/sgd.cpp.
SGD performance is 10 times slower than pytorch cpu version. To Reproduce Just run any binary op with tensor and scalar.
Expected behavior Scalar parameter should be passed as runtime arg to the program. We should never create a new tensor from cpu from every call.
Additional context @eyonland I assigned this ticket to you as elementwise owner. My expectation that you can drive it to the LLK team and make sure they and your team can add needed changes in both metal and ttnn level. If you cannot do it for some reason please let me know I'll find a new owner. It significantly reduces performance of our training code.