tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
424 stars 54 forks source link

Hang in pre-tilize in conv compute kernel when in0 is BFLOAT8_B and in1 is BFLOAT16 #4280

Closed TT-BrianLiu closed 7 months ago

TT-BrianLiu commented 9 months ago

In tt_eager/tt_dnn/op_library/conv/kernels/conv_bmm_tilize_col_major_out_blocks.cpp, line 169 should work but hangs if in1_cb is BFLOAT16 and the in0 for matmul (here, tilized_in0_cb_id) is BFLOAT8_B. in0_pretilize_cb_id is ROW_MAJOR and always BFLOAT16, so I don't think we even need sraA reconfig in this case?

image

To get the test to pass, you can force the reconfig to happen with: unpack_reconfig_data_format_srca(in1_cb_id);

For reference, there are other convs where we don't have the pre-tilize block and instead tilize once we get input from reader. The dataformats in the llk APIs should be the same here, but we don't need to force the reconfig.

image

Branch: bliu/issue-4280

Test:

pytest models/demos/resnet/tests/test_resnet50_untilize_with_halo_and_conv_v2.py::test_resnet50_conv -k HiFi4-activations_BFLOAT8_B-weights_BFLOAT16-512-512-7-7-3-3-1-1-1-1-batch_8
davorchap commented 8 months ago

@TT-BrianLiu is this still relevant?

TT-BrianLiu commented 8 months ago

@TT-BrianLiu is this still relevant?

I think so. Definitely something fishy going on when we switch between tilize and matmul. @tt-aho saw something similar when uplifting attention matmul to support mixed precision. He had to force the srca reconfig when it wasn't needed / shouldn't be doing anything.

rtawfik01 commented 7 months ago

I have a fix here: #5536 The issue is the tilize_uninit call reprograms the unpack config back to what was stored before tilize was executed. In this case, the default operand passed to tilize uninit is 0, when in reality it should have reverted back to the config of operand 1. I passed in the correct operand to the tilize uninit function so the test can pass. Ill debug the rest of these cases as well and try to make that operand not take in a default

rtawfik01 commented 7 months ago

Merged fix here: https://github.com/tenstorrent-metal/tt-metal/pull/5536