tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
464 stars 72 forks source link

Llama3 Decode Perf Burndown #12103

Open cglagovichTT opened 2 months ago

cglagovichTT commented 2 months ago

Decode 128

We are aiming to hit 20 t/s/u end to end for Llama3 decode on t3k. These are the issues left.

image

Device Perf

Target: 625µs per layer on device. Today we are at 685 µs per layer. In order to hit the target, we have three issues:

  1. https://github.com/tenstorrent/tt-metal/issues/11853 Use double buffering in ReduceScatter (@caixunshiren)
    • Saves 25 µs
  2. https://github.com/tenstorrent/tt-metal/issues/10415 Overlap AllGather and matmul_1d for dense_out (@avoraTT)
    • Saves 28 µs
  3. https://github.com/tenstorrent/tt-metal/issues/12107 Achieve expected bfp8_b AllGather perf with tracing enabled (@SeanNijjar)
    • Saves 32 µs

If we resolve these three issues and achieve the expected speedup for each of them, we will be nicely beyond 20 t/s/u on device.

E2E Perf

Dispatch adds up to 162µs per layer. CCL accounts for 84µs of that. We have one item of work which optimizes CCL dispatch to bring the total dispatch time per layer to 108µs.

  1. https://github.com/tenstorrent/tt-metal/pull/11957, https://github.com/tenstorrent/tt-metal/pull/11967 Reduce CCL dispatch time (@caixunshiren)

Beyond that, we need zero-latency dispatch in order to achieve 20 t/s/u end to end. https://github.com/tenstorrent/tt-metal/issues/12074 this issue has information on the request.

cglagovichTT commented 2 months ago

Sheet: https://docs.google.com/spreadsheets/d/1dF0eT33eh6d5OURSMSiWv5mIBReE4-DvJ82_nao7G_E/edit?gid=95984160#gid=95984160

uaydonat commented 2 months ago

fyi @davorchap

SeanNijjar commented 1 month ago

I've found a core issue in the dispatch path that really hurts CCL performance (especially for these smaller CCLs). It's here; https://github.com/tenstorrent/tt-metal/issues/12395.

With this change, and everything in master, I was seeing some all-gather instances with cycle counts as low as 21k (several thousand cycles better than the current target).

cglagovichTT commented 1 month ago

Updated sheet after applying fused AllGather matmul + CCL team's various optimizations. Summary: device dispatch fully explains the difference between device perf and e2e perf.

Device: 19.3 t/s/u With dispatch: 15.5 t/s/u With host <-> device data transfer: 15.1 t/s/u

image

cglagovichTT commented 1 month ago

2.5-3 weeks low latency drop on t3k (ring buffer) Turn of eth dispatch for a win (8x7 grid) -- reduces prefetcher and dispatch buffering, hurts BW to DRAM

cglagovichTT commented 1 month ago

Paul merged worst-case dispatch optimization last night. Aditya is going to merge in a few days which undoes the optimization.