Open caixunshiren opened 2 weeks ago
Update:
mul_block_inplace
and add_block_inplace
to be bf16 cbs during im and stat accumulation/updates, otherwise we get pcc degradation compared to bf16. My WIP work is on branch sdpa-fp32-investigations
reconfig_dataformat
doesn't work as expected and could hang/wrong result for in/out_cb. This explains the flash decode fp32 accumulate hang in reducer cores that I saw earlier: https://github.com/tenstorrent/tt-metal/issues/9608
Description
We do not have support for fp32 accumulate in sdpa family kernels. This becomes a problem when number of chunks gets large and we see diverging pcc from ground truth. For models that requires 128K sequel, this is problematic.
This issue tracks the enabling of fp32 accumulate in the following kernels:
round 1:
round 2:
FYI @cglagovichTT