tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
380 stars 47 forks source link

[Falcon40b] Prefill perf burndown #8049

Open johanna-rock-tt opened 4 months ago

johanna-rock-tt commented 4 months ago

Tracks the open issues for Falcon40b prefill to hit target perf.

Last updated: May 27th

Prefill

bfp8

Seqlen=2048

1 layer currently takes 12.54 ms One-time ops: + 2.2 ms

Open Issues (bfp8 version):

milestone single layer latency (ms) full model latency (ms) t/s notes
current performance (bfp8) 12.53 752.62 2721 May 22nd, main + residual in fp16
target @50% 5.47 328.2 6240
nvidia benchmark 10.9 654.6 3128
projected after burndown (bfp8) 8.08 487 4201 Considered in projection: hiding CCL time, SPDA

Target estimates see: https://docs.google.com/spreadsheets/d/1LawF5YIbAQC1c7vMJG7z-_qh1YXojpnWNvgLMBv5jC8 Assuming HiFi2, S=2048 has a target of 6240 tok/sec

6240 tok/sec target
6240/2048 = 3.05 batches / sec
3.05/1000 = 0.00304688 batches / ms
1/0.00304688 = 328.2 ms per inference
328.2/60 = 5.47 ms per layer

Seqlen=128

Assuming same optimizations as above. 2k is the harder regime, but optimizations should also apply to seqlen=128.

1 layer currently takes 1.80 ms One-time ops: + 0.87 ms

WIP Projections:

milestone single layer latency (ms) full model latency (ms) t/s notes
current performance (bfp8) 1.80 108.06 1184 May 2nd, main@61b385, unidirectional AllGather
target @50% 0.83 50 2560
nvidia benchmark 1.4 84.04 1523
projected after burndown (bfp8) 1.48 90 1705 Considered in projection: hiding CCL time

Assuming HiFi2 here, then S=128 has a target of 1423 tok/sec

2560 tok/sec target
2560/128 = 20 batches / sec
20/1000 = 0.02 batches / ms
1/0.02 = 50 ms per inference
50/60 = 0.83 ms per layer
johanna-rock-tt commented 4 months ago

Current perf breakdown: https://docs.google.com/spreadsheets/d/1YCIqrqJ6cPd0AVZs1DiXuEOOrPL9boYmEIFSDufKj6M/edit?usp=sharing Branch: main@61b3856b7d42daf1888604b53a6090733e52f5be System: t-3012 @ 1GHz

johanna-rock-tt commented 4 months ago

Most costly ops for S=2k

Screenshot 2024-05-03 at 13 12 13

LN accumulated (slices): 3.8% per layernorm

johanna-rock-tt commented 4 months ago

Most costly ops for S=128

Screenshot 2024-05-03 at 13 13 42
johanna-rock-tt commented 4 months ago

fyi @pavlejosipovic @s-jovic @pavlepopovic @uaydonat

djordje-tt commented 3 months ago

After couple of weeks and several changes there is new perf burndown:

Breakdown per type

Type % Time(ns)
AllGather 48.5 9.57
Matmuls 35.9 7.08
Layernorm 6.7 1.31

Seqlen=128 (bfloat16)

milestone single layer latency (ms) full model latency (ms) t/s notes
current performance (bfloat16) 2.24 134.5 951.6 May 16th, main, biidirectional AllGather
target @50% 0.83 50 2560
nvidia benchmark 1.4 84.04 1523
projected after burndown (bfloat16) 1.68 101.4 1262

Breakdown per type

Type % Time(ns)
AllGather 30.9 0.69
Matmuls 56.5 1.27
Layernorm 5.9 0.13

Next steps for seq_len=2048:

I believe the tasks should be executed in respective order to optimize model in areas where the largest impact can be made according to breakdown.

johanna-rock-tt commented 3 months ago

We already tested casting activations for AllGather to bfp4_b and that resulted in bad pcc (already for one layer), but using bfp8_b gave good pcc.

djordje-tt commented 3 months ago

e2e perf issue: #8866