Open kpaigwar opened 2 months ago
fyi @esmalTT @uaydonat
Are different batches running on different cores?
What's the size of the weights?
Does it mean, in the unoptimized case, every core is reading the same data from dram? So there num_batches times more dram reads?
Weights Shape is [1, 1, 1, hidden_size], where hidden_size = 5120 and batch_size=32 The multiply kernel is using all 64 cores from the perf sheet. I think the work split is done along hidden_dim. With each core having activation of size [1, 1, 32, 80].
I don't think batch number of times DRAM reads Making it slow. Cause in the case of pre-broacasted weights there will be same number of DRAM reads and its faster.
Apparently in some models such as Mamba, there are a few eltwise_operations (add/ multiply) between input_tensors and weights. Typically, input_tensors have a batch_dim>1 and weights don't have a notion of batch_dim. This requires broadcasting of weights along batch_dim before eltwise operations.
Presently, we support internal broadcasting of weights in ttnn but these operations are pretty slow. This has been validated by comparing performance with pre-broadcasted weights in the below unit test. As can be seen in the table, multiply with pre-broadcasted weights is 32x faster. However, pre-broadcasting of weights is not a tractable solution as this will reduce the DRAM space.
Perf Sheet ttnn_multply_perf.csv