tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
473 stars 75 forks source link

Add support for FP32/TF32 data formats in WH #4686

Open davorchap opened 10 months ago

davorchap commented 10 months ago
davorchap commented 9 months ago

a dependency to train / fine-tune a model

yugaoTT commented 9 months ago

Current issues:

  1. host side conversion bug from fp32 to uint32 and vice versa: fixed
  2. matmul with small tensors (64,64,64) can pass, large tensor fail with bad pcc: investigating host side
  3. llk third party changes causing pcc error: fixed by @rtawfik01
  4. torch2tt_tensor, when store tensor in DRAM, the reader reads out wrong values, when in L1, the value is correct: debugging host side
yugaoTT commented 9 months ago

update: this function is causing issues on the device side, it does not have Float32 case before the change

inline __attribute__((always_inline)) constexpr static std::uint32_t MUL_WITH_TILE_SIZE(uint format, uint index) {
    switch (format & 0x1F) {
        case ((uint8_t)DataFormat::Float32): return (index << 12);
        case ((uint8_t)DataFormat::Float16):
        case ((uint8_t)DataFormat::Float16_b): return (index << 11);
        case ((uint8_t)DataFormat::Bfp8_b):
        // Keep default as Bfp8?
        default: return ((index << 10) + (index << 6));
    };
}
yugaoTT commented 9 months ago

The fp32 (input, output, fp32_acc_en) works now for matmul, matmul_no_mcast, and matmul_1d. I think the OPs needed to add fp32 support (at least to support for Falcon 40B) are follows:

rotary_embedding
matmul
matmul_1d
group_attn_matmul
nlp_create_qkv_heads
nlp_concat_heads
update_cache
unpad
transpose
scale_mask_softmax_in_place
layernorm
all_gather
binary add
interleaved_to_sharded
sharded_to_interleaved
embeddings
fill_cache

@tt-aho pls let me know the above are all the ops required in Falcon, and if there are any redundant ones.

tt-aho commented 9 months ago

There is also layernorm, binary add, interleaved_to_sharded and sharded_to_interleaved (which may be fused into one reshard op or optimized away), and potentially embeddings (currently on host but plan to move to device for inference at some point) For F40B prefill I think ops are the same with exception of using fill_cache instead of update_cache

yugaoTT commented 9 months ago

thanks I updated it.

ttmtrajkovic commented 9 months ago

Met with @yugaoTT, the following are notes:

milos

jliangTT commented 9 months ago

@ttmtrajkovic , can we stagger the merge such that the support for matmul for this feature is in main, and roll out every other ops incremetally as we go?

yugaoTT commented 9 months ago

@jliangTT yes I can merge the matmul only. Then work on other ops. For the other ops, do we want math_approx_mode, fp32_dest_acc_en, math_fidelity all exposed, or just fp32_dest_acc_en?

jliangTT commented 9 months ago

Every op has to be updated to have an extra fp32 dest accumulation parameter. It would be good to maintain a spreadsheet which ops support this and which not.

I like @ttmtrajkovic's idea here and we can reason through math_approx_mode, fp32_dest_acc_en, math_fidelity all exposed, vs. just fp32_dest_acc_en op by op.

ttmtrajkovic commented 9 months ago

@jliangTT, That at topic is being discussed. I need some time to figure out how are ops defined and such and then prpose some updates to make these params common.

@yugaoTT, @jliangTT The rollout can and should be incremental, but it would be great to see a spreadsheet of ops that support fp32 accumulation and the ones that dont.

Milos

yugaoTT commented 9 months ago

here is the op list https://tenstorrent.sharepoint.com/:x:/r/sites/Jasmina/_layouts/15/Doc.aspx?sourcedoc=%7B08306401-7C55-4F8C-BD7A-8F2C4B8EB9C7%7D&file=Book.xlsx&action=default&mobileredirect=true

yugaoTT commented 8 months ago

Update: the list above is done, needs model side testing then merge. Next step: add fp32 to convs.

  1. optimized_conv
  2. groupnorm
jliangTT commented 8 months ago

Update: the list above is done, needs model side testing then merge. Next step: add fp32 to convs.

  1. optimized_conv
  2. groupnorm

@razorback3, fp32 support is added for matmul. Please check in main.

ttmtrajkovic commented 8 months ago

@yugaoTT, Could you please update the ticket with the status of adding fp32 to the remaining ops? thanks.

yugaoTT commented 8 months ago

the list above is done, but haven't merged yet. Since all the ops (convs, matmul, group_attn_matmul) that needs immediate fp32 support is merged, can we downgrade the urgency of this issue?

jliangTT commented 8 months ago

ok. i will lower it to P2 for now as this has meet the urgent need to add this to matmul.

ttmtrajkovic commented 8 months ago

Thanks, although I am not sure I fully understand. So all the ops from the list have fp32 added, to either a branch (unmerged) or in main (merged), is that the correct? If yes, please specify status of merged vs non-merged.

Other than the ops on the list, are there others that would need fp32 accumulation?

My suggestion would be to wrap up the task and enable this across the board since there will be users that for training need to propagate fp32 to pretty much all the ops, especially if they create more complex fused ops. All compute OPs that operate on DST, should have this parameter.

Milos

ttmtrajkovic commented 8 months ago

ok. i will lower it to P2 for now as this has meet the urgent need to add this to matmul.

@jliangTT, I'd prefer if this is P1 - feature needed as training will heavily rely fp32. in addition, this is still a feature that is needed, rather than nice to have.

yugaoTT commented 8 months ago

it is in yugao/fp32_nlp_debug

merged: matmul matmul_1d group_attn_matmul optimized conv

unmerged rotary_embedding nlp_create_qkv_heads nlp_concat_heads update_cache unpad transpose scale_mask_softmax_in_place layernorm all_gather binary add interleaved_to_sharded sharded_to_interleaved embeddings fill_cache

It is not the full list but covers most ops, there are other ops in tt_eager/tt_dnn/op_library that is not in the list.

I can add them after developing GN and benchmarking DRAM. when will the model training start?

ttmtrajkovic commented 8 months ago

Thanks @yugaoTT. The training is already in progress by Moreh and explicit request has been made for eltwise unary ops (to manipulate fp32 data in sfpu). I will work on adding that support. Let's revisit status of this next week.

jliangTT commented 8 months ago

i am okay with keeping it as P1. From a tactical angle, Moreh is likely to integrate the top few ops already in main now, then more later.

yugaoTT commented 8 months ago

thanks @ttmtrajkovic

yugaoTT commented 8 months ago

@ttmtrajkovic I can make a few PR to rollout the ops above batch by batch this week.