johanna-rock-tt commented 4 months ago

Tracks the open issues for Falcon40b prefill to hit target perf.

Last updated: May 27th

Prefill

bfp8

Measured May 22nd, main
1 GHz
Perf measurements based on 1 layer perf and extrapolated to 60 layers
BFP8_B with HiFi2, residual path on fp16 to preserve PCC > 0.99

Seqlen=2048

1 layer currently takes 12.54 ms One-time ops: + 2.2 ms

Open Issues (bfp8 version):

CLL ops
- [x] Bidirectional
  - ~1.7 x speedup
- [ ] 2-link AllGather
  - ~1.7x speedup
- [ ] ReduceScatter for MLP
  - 2x faster than AllGather, saves 2.5 ms
[ ] Overlapped CCL
- Could save up to 6.4 ms for perfect overlapping
[ ] #7461 Optimized SDPA, flash-attention style
- Saves ~ 0.41 ms (SPDA speedup estimated from 0.55 per SPDA as measured for our shapes for 800 MHz -> 0.41 SPDA for 1GHz)
[ ] Optimizing matmul_2d
- TODO: estimate perf potential
[ ] Custom fused layernorm (mean, var is the same for both layernorms; gamma+beta separate)
- saves ~ 0.5 ms
[ ] Distributed layernorm
- TODO: estimate
- We would need this to hide CCL time for AllGather before LN
- This would enable us to parallelize the layernorm itself (1 slice, computation in parallel)
[x] Utilize all cores for FusedQKV
- saves 0.11 ms

milestone	single layer latency (ms)	full model latency (ms)	t/s	notes
current performance (bfp8)	12.53	752.62	2721	May 22nd, main + residual in fp16
target @50%	5.47	328.2	6240
nvidia benchmark	10.9	654.6	3128
projected after burndown (bfp8)	8.08	487	4201	Considered in projection: hiding CCL time, SPDA

Target estimates see: https://docs.google.com/spreadsheets/d/1LawF5YIbAQC1c7vMJG7z-_qh1YXojpnWNvgLMBv5jC8 Assuming HiFi2, S=2048 has a target of 6240 tok/sec

6240 tok/sec target
6240/2048 = 3.05 batches / sec
3.05/1000 = 0.00304688 batches / ms
1/0.00304688 = 328.2 ms per inference
328.2/60 = 5.47 ms per layer

Seqlen=128

Assuming same optimizations as above. 2k is the harder regime, but optimizations should also apply to seqlen=128.

1 layer currently takes 1.80 ms One-time ops: + 0.87 ms

WIP Projections:

Assumption: SPDA won't help us much here so no considered here
(AllGather 4x speedup + ) hiding CCL time considered
TODO: project potential MM optimization speedup (not considered yet)

milestone	single layer latency (ms)	full model latency (ms)	t/s	notes
current performance (bfp8)	1.80	108.06	1184	May 2nd, main@61b385, unidirectional AllGather
target @50%	0.83	50	2560
nvidia benchmark	1.4	84.04	1523
projected after burndown (bfp8)	1.48	90	1705	Considered in projection: hiding CCL time

Assuming HiFi2 here, then S=128 has a target of 1423 tok/sec

2560 tok/sec target
2560/128 = 20 batches / sec
20/1000 = 0.02 batches / ms
1/0.02 = 50 ms per inference
50/60 = 0.83 ms per layer

johanna-rock-tt commented 4 months ago

Current perf breakdown: https://docs.google.com/spreadsheets/d/1YCIqrqJ6cPd0AVZs1DiXuEOOrPL9boYmEIFSDufKj6M/edit?usp=sharing Branch: main@61b3856b7d42daf1888604b53a6090733e52f5be System: t-3012 @ 1GHz

johanna-rock-tt commented 4 months ago

Most costly ops for S=2k

LN accumulated (slices): 3.8% per layernorm

johanna-rock-tt commented 4 months ago

Most costly ops for S=128

johanna-rock-tt commented 4 months ago

fyi @pavlejosipovic @s-jovic @pavlepopovic @uaydonat

djordje-tt commented 3 months ago

After couple of weeks and several changes there is new perf burndown:

bidirectional allGather

using bfloat16 for PCC=0.99

Seqlen=2048 (bfloat16)

milestone	single layer latency (ms)	full model latency (ms)	t/s	notes
current performance (bfloat16)	19.71	1182.8	1731	May 14th, main, bidirectional AllGather
target @50%	5.47	328.2	6240
nvidia benchmark	10.9	654.6	3128
projected after burndown (bfloat)	11.3	684	2991

Breakdown per type

Type	%	Time(ns)
AllGather	48.5	9.57
Matmuls	35.9	7.08
Layernorm	6.7	1.31

Seqlen=128 (bfloat16)

milestone	single layer latency (ms)	full model latency (ms)	t/s	notes
current performance (bfloat16)	2.24	134.5	951.6	May 16th, main, biidirectional AllGather
target @50%	0.83	50	2560
nvidia benchmark	1.4	84.04	1523
projected after burndown (bfloat16)	1.68	101.4	1262

Breakdown per type

Type	%	Time(ns)
AllGather	30.9	0.69
Matmuls	56.5	1.27
Layernorm	5.9	0.13

Next steps for seq_len=2048:

[ ] Calculate e2e perf and compare host vs device time
[ ] Optimize matmul_2d
- Increase subblocks h and w from 1 to optimal ones in matmuls (ff1, ff2, dense_mm, lm_head)
[ ] Cast activations to bfp8/4 before AllGather in MLP and cast back to bfloat16 after to preserve precision and PCC=0.99
[ ] Layernorm optimization
- Remove 1 ln and share another with additional gamma and beta args

I believe the tasks should be executed in respective order to optimize model in areas where the largest impact can be made according to breakdown.

johanna-rock-tt commented 3 months ago

We already tested casting activations for AllGather to bfp4_b and that resulted in bad pcc (already for one layer), but using bfp8_b gave good pcc.

djordje-tt commented 3 months ago

e2e perf issue: #8866

tenstorrent / tt-metal

[Falcon40b] Prefill perf burndown #8049

Prefill

Seqlen=2048

Seqlen=128

Seqlen=2048 (bfloat16)

Seqlen=128 (bfloat16)