Open davorchap opened 10 months ago
a dependency to train / fine-tune a model
Current issues:
update: this function is causing issues on the device side, it does not have Float32 case before the change
inline __attribute__((always_inline)) constexpr static std::uint32_t MUL_WITH_TILE_SIZE(uint format, uint index) {
switch (format & 0x1F) {
case ((uint8_t)DataFormat::Float32): return (index << 12);
case ((uint8_t)DataFormat::Float16):
case ((uint8_t)DataFormat::Float16_b): return (index << 11);
case ((uint8_t)DataFormat::Bfp8_b):
// Keep default as Bfp8?
default: return ((index << 10) + (index << 6));
};
}
The fp32 (input, output, fp32_acc_en) works now for matmul, matmul_no_mcast, and matmul_1d. I think the OPs needed to add fp32 support (at least to support for Falcon 40B) are follows:
rotary_embedding
matmul
matmul_1d
group_attn_matmul
nlp_create_qkv_heads
nlp_concat_heads
update_cache
unpad
transpose
scale_mask_softmax_in_place
layernorm
all_gather
binary add
interleaved_to_sharded
sharded_to_interleaved
embeddings
fill_cache
@tt-aho pls let me know the above are all the ops required in Falcon, and if there are any redundant ones.
There is also layernorm, binary add, interleaved_to_sharded and sharded_to_interleaved (which may be fused into one reshard op or optimized away), and potentially embeddings (currently on host but plan to move to device for inference at some point) For F40B prefill I think ops are the same with exception of using fill_cache instead of update_cache
thanks I updated it.
Met with @yugaoTT, the following are notes:
milos
@ttmtrajkovic , can we stagger the merge such that the support for matmul for this feature is in main, and roll out every other ops incremetally as we go?
@jliangTT yes I can merge the matmul only. Then work on other ops. For the other ops, do we want math_approx_mode, fp32_dest_acc_en, math_fidelity all exposed, or just fp32_dest_acc_en?
Every op has to be updated to have an extra fp32 dest accumulation parameter. It would be good to maintain a spreadsheet which ops support this and which not.
I like @ttmtrajkovic's idea here and we can reason through math_approx_mode, fp32_dest_acc_en, math_fidelity all exposed, vs. just fp32_dest_acc_en op by op.
@jliangTT, That at topic is being discussed. I need some time to figure out how are ops defined and such and then prpose some updates to make these params common.
@yugaoTT, @jliangTT The rollout can and should be incremental, but it would be great to see a spreadsheet of ops that support fp32 accumulation and the ones that dont.
Milos
Update: the list above is done, needs model side testing then merge. Next step: add fp32 to convs.
Update: the list above is done, needs model side testing then merge. Next step: add fp32 to convs.
- optimized_conv
- groupnorm
@razorback3, fp32 support is added for matmul. Please check in main.
@yugaoTT, Could you please update the ticket with the status of adding fp32 to the remaining ops? thanks.
the list above is done, but haven't merged yet. Since all the ops (convs, matmul, group_attn_matmul) that needs immediate fp32 support is merged, can we downgrade the urgency of this issue?
ok. i will lower it to P2 for now as this has meet the urgent need to add this to matmul.
Thanks, although I am not sure I fully understand. So all the ops from the list have fp32 added, to either a branch (unmerged) or in main (merged), is that the correct? If yes, please specify status of merged vs non-merged.
Other than the ops on the list, are there others that would need fp32 accumulation?
My suggestion would be to wrap up the task and enable this across the board since there will be users that for training need to propagate fp32 to pretty much all the ops, especially if they create more complex fused ops. All compute OPs that operate on DST, should have this parameter.
Milos
ok. i will lower it to P2 for now as this has meet the urgent need to add this to matmul.
@jliangTT, I'd prefer if this is P1 - feature needed as training will heavily rely fp32. in addition, this is still a feature that is needed, rather than nice to have.
it is in yugao/fp32_nlp_debug
merged: matmul matmul_1d group_attn_matmul optimized conv
unmerged rotary_embedding nlp_create_qkv_heads nlp_concat_heads update_cache unpad transpose scale_mask_softmax_in_place layernorm all_gather binary add interleaved_to_sharded sharded_to_interleaved embeddings fill_cache
It is not the full list but covers most ops, there are other ops in tt_eager/tt_dnn/op_library
that is not in the list.
I can add them after developing GN and benchmarking DRAM. when will the model training start?
Thanks @yugaoTT. The training is already in progress by Moreh and explicit request has been made for eltwise unary ops (to manipulate fp32 data in sfpu). I will work on adding that support. Let's revisit status of this next week.
i am okay with keeping it as P1. From a tactical angle, Moreh is likely to integrate the top few ops already in main now, then more later.
thanks @ttmtrajkovic
@ttmtrajkovic I can make a few PR to rollout the ops above batch by batch this week.