Add support for FP32/TF32 data formats in WH

davorchap commented 10 months ago

add FP32/TF32 data formats for Tensors / CBs and basic metal test

davorchap commented 9 months ago

a dependency to train / fine-tune a model

yugaoTT commented 9 months ago

Current issues:

host side conversion bug from fp32 to uint32 and vice versa: fixed
matmul with small tensors (64,64,64) can pass, large tensor fail with bad pcc: investigating host side
llk third party changes causing pcc error: fixed by @rtawfik01
torch2tt_tensor, when store tensor in DRAM, the reader reads out wrong values, when in L1, the value is correct: debugging host side

yugaoTT commented 9 months ago

update: this function is causing issues on the device side, it does not have Float32 case before the change

inline __attribute__((always_inline)) constexpr static std::uint32_t MUL_WITH_TILE_SIZE(uint format, uint index) {
    switch (format & 0x1F) {
        case ((uint8_t)DataFormat::Float32): return (index << 12);
        case ((uint8_t)DataFormat::Float16):
        case ((uint8_t)DataFormat::Float16_b): return (index << 11);
        case ((uint8_t)DataFormat::Bfp8_b):
        // Keep default as Bfp8?
        default: return ((index << 10) + (index << 6));
    };
}

yugaoTT commented 9 months ago

The fp32 (input, output, fp32_acc_en) works now for matmul, matmul_no_mcast, and matmul_1d. I think the OPs needed to add fp32 support (at least to support for Falcon 40B) are follows:

rotary_embedding
matmul
matmul_1d
group_attn_matmul
nlp_create_qkv_heads
nlp_concat_heads
update_cache
unpad
transpose
scale_mask_softmax_in_place
layernorm
all_gather
binary add
interleaved_to_sharded
sharded_to_interleaved
embeddings
fill_cache

@tt-aho pls let me know the above are all the ops required in Falcon, and if there are any redundant ones.

tt-aho commented 9 months ago

There is also layernorm, binary add, interleaved_to_sharded and sharded_to_interleaved (which may be fused into one reshard op or optimized away), and potentially embeddings (currently on host but plan to move to device for inference at some point) For F40B prefill I think ops are the same with exception of using fill_cache instead of update_cache

yugaoTT commented 9 months ago

thanks I updated it.

ttmtrajkovic commented 9 months ago

Met with @yugaoTT, the following are notes:

Every op has to be updated to have an extra fp32 dest accumulation parameter. It would be good to maintain a spreadsheet which ops support this and which not.
Fp32 as input format should already exist, although it can't be transferred from host to device. The fix should already be in place.
Fp32 as input format will get unpacked into tf32 by the unpacker - that's already in place.

milos

jliangTT commented 9 months ago

@ttmtrajkovic , can we stagger the merge such that the support for matmul for this feature is in main, and roll out every other ops incremetally as we go?

yugaoTT commented 9 months ago

@jliangTT yes I can merge the matmul only. Then work on other ops. For the other ops, do we want math_approx_mode, fp32_dest_acc_en, math_fidelity all exposed, or just fp32_dest_acc_en?

jliangTT commented 9 months ago

Every op has to be updated to have an extra fp32 dest accumulation parameter. It would be good to maintain a spreadsheet which ops support this and which not.

I like @ttmtrajkovic's idea here and we can reason through math_approx_mode, fp32_dest_acc_en, math_fidelity all exposed, vs. just fp32_dest_acc_en op by op.

ttmtrajkovic commented 9 months ago

@jliangTT, That at topic is being discussed. I need some time to figure out how are ops defined and such and then prpose some updates to make these params common.

@yugaoTT, @jliangTT The rollout can and should be incremental, but it would be great to see a spreadsheet of ops that support fp32 accumulation and the ones that dont.

Milos

yugaoTT commented 9 months ago

here is the op list https://tenstorrent.sharepoint.com/:x:/r/sites/Jasmina/_layouts/15/Doc.aspx?sourcedoc=%7B08306401-7C55-4F8C-BD7A-8F2C4B8EB9C7%7D&file=Book.xlsx&action=default&mobileredirect=true

yugaoTT commented 8 months ago

Update: the list above is done, needs model side testing then merge. Next step: add fp32 to convs.

optimized_conv
groupnorm

jliangTT commented 8 months ago

Update: the list above is done, needs model side testing then merge. Next step: add fp32 to convs.

optimized_conv

groupnorm

@razorback3, fp32 support is added for matmul. Please check in main.

ttmtrajkovic commented 8 months ago

@yugaoTT, Could you please update the ticket with the status of adding fp32 to the remaining ops? thanks.

yugaoTT commented 8 months ago

the list above is done, but haven't merged yet. Since all the ops (convs, matmul, group_attn_matmul) that needs immediate fp32 support is merged, can we downgrade the urgency of this issue?

jliangTT commented 8 months ago

ok. i will lower it to P2 for now as this has meet the urgent need to add this to matmul.

ttmtrajkovic commented 8 months ago

Thanks, although I am not sure I fully understand. So all the ops from the list have fp32 added, to either a branch (unmerged) or in main (merged), is that the correct? If yes, please specify status of merged vs non-merged.

Other than the ops on the list, are there others that would need fp32 accumulation?

My suggestion would be to wrap up the task and enable this across the board since there will be users that for training need to propagate fp32 to pretty much all the ops, especially if they create more complex fused ops. All compute OPs that operate on DST, should have this parameter.

Milos

ttmtrajkovic commented 8 months ago

ok. i will lower it to P2 for now as this has meet the urgent need to add this to matmul.

@jliangTT, I'd prefer if this is P1 - feature needed as training will heavily rely fp32. in addition, this is still a feature that is needed, rather than nice to have.

yugaoTT commented 8 months ago

it is in yugao/fp32_nlp_debug

merged: matmul matmul_1d group_attn_matmul optimized conv

unmerged rotary_embedding nlp_create_qkv_heads nlp_concat_heads update_cache unpad transpose scale_mask_softmax_in_place layernorm all_gather binary add interleaved_to_sharded sharded_to_interleaved embeddings fill_cache

It is not the full list but covers most ops, there are other ops in tt_eager/tt_dnn/op_library that is not in the list.

I can add them after developing GN and benchmarking DRAM. when will the model training start?

ttmtrajkovic commented 8 months ago

Thanks @yugaoTT. The training is already in progress by Moreh and explicit request has been made for eltwise unary ops (to manipulate fp32 data in sfpu). I will work on adding that support. Let's revisit status of this next week.

jliangTT commented 8 months ago

i am okay with keeping it as P1. From a tactical angle, Moreh is likely to integrate the top few ops already in main now, then more later.

yugaoTT commented 8 months ago

thanks @ttmtrajkovic

yugaoTT commented 8 months ago

@ttmtrajkovic I can make a few PR to rollout the ops above batch by batch this week.

tenstorrent / tt-metal

Add support for FP32/TF32 data formats in WH #4686