tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
373 stars 44 forks source link

Enable stochastic rounding in FPU (wormhole_b0) #6094

Open skhorasganiTT opened 5 months ago

skhorasganiTT commented 5 months ago

LLM models such as Falcon are encountering an accumulation of error when performing large matmuls over multiple layers. This has been determined to be caused by an accumulation of rounding error in destination registers when there are a large number of spills and reloads. This error can be mitigated by enabling stochastic rounding in the FPU (see issue #5729). In addition, it has been observed that enabling stochastic rounding improves model PCC even with packer l1 accumulation and fp32 dest accumulation already enabled.

Due to the improvements mentioned above, and since stochastic rounding is expected to improve PCC in general (as confirmed by @acejkov), it should be enabled by default.

FYI @uaydonat

ttmtrajkovic commented 5 months ago

Tensix RTL team hasn't verified fp32 accumulation and fpu stochastic rounding so that, although it might work, shouldn't be used. I'd like to see how much is it really improving PCC, but also please confirm how it affects PCC with real data.

We can enable this for all matmul and conv operations with fp16 accumulation by default. The drawback is power increase (due to random number generators toggling) and reproducibility of results.

Given that we don't worry about power yet, we can just enable this as part of configure for any matmul with fp16 accumulation and add op hook to control it later

ttmtrajkovic commented 5 months ago

@skhorasganiTT, What is the priority of this? Are you able to make forward progress with just L1 accumulation?

skhorasganiTT commented 5 months ago

@ttmtrajkovic these are the pcc numbers for Falcon7b-decode-kvcache2047 with packer_l1_acc already enabled (after 32 layers and before lm head):

FP32 Accumulation Without stochastic rounding: output: 0.9896 | k cache 0.898 | v cache 0.887 With stochastic rounding: output 0.997 | k cache 0.981 | v cache 0.972 BF16 Accumulation Without stochastic rounding: output: 0.989 | kcache 0.913 | vcache 0.900 With stochastic rounding: output: 0.993 | kcache 0.954 | vcache 0.949

In terms of priority, it is not necessarily urgent right now, since the models team can temporarily use an llk override hack to enable stochastic rounding if needed.