tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
366 stars 36 forks source link

[Blackhole Bringup] Pack untilize kernel #10052

Open rtawfik01 opened 1 month ago

rtawfik01 commented 1 month ago

Pack untilize kernel for blackhole is incomplete, due to packer implementation changing for blackhole. This issue is 2 parts:

fyi @abhullar-tt @ttmtrajkovic

rtawfik01 commented 1 month ago

Added the standalone test here: #10057

abhullar-tt commented 1 month ago

Does this manifest as

"compute_kernel_api/pack_untilize.h:28:48: error: too many arguments to function 'void llk_pack_untilize_init() [with long unsigned int block_ct_dim = 1]'
   28 |     PACK(( llk_pack_untilize_init<block_ct_dim>(ocb) ));"
rtawfik01 commented 1 month ago

Does this manifest as

"compute_kernel_api/pack_untilize.h:28:48: error: too many arguments to function 'void llk_pack_untilize_init() [with long unsigned int block_ct_dim = 1]'
   28 |     PACK(( llk_pack_untilize_init<block_ct_dim>(ocb) ));"

That's the first part, the second part is after the compilation error it mismatches, but that's because pack_untilize algorithm was not ported to the new blackhole packer functionality

rtawfik01 commented 1 month ago

Pushed here: https://github.com/tenstorrent/tt-metal/pull/10422 and now all of these tests pass:

pytest -svv tests/ttnn/unit_tests/operations/test_activation.py
pytest -svv tests/ttnn/unit_tests/test_to_layout.py
pytest -svv tests/ttnn/unit_tests/operations/test_elt_binary.py
TT_METAL_WATCHER=5 TT_METAL_DPRINT_CORES=0,0 TT_METAL_SLOW_DISPATCH_MODE=1 ./build/test/tt_metal/test_untilize

Will remove "P1_critical" label. But @ttmtrajkovic there is still some testing missing:

  1. The above tests never have block_rt_dim > 1
  2. The above tests do not have output of sizes of 1x16 or 1x32 (what fused reduce tests expect)
  3. The above tests do not have fused pack_untilize, when pack_untilize if fused, a decision needs to be made on how
    MATH(( llk_math_hw_configure_disaggregated<true>() ));

    should be called, because it needs to be on the math thread, for max pool tests the above config needs to be done when initializing the reduce

rtawfik01 commented 20 hours ago

Update on this issue.

Have some fixes on this branch: rtawfik/reduce_untilize_a_b

With the testing that is missing:

  1. Don't see any WHB0 tests that use block_rt_dim>1, we should add unit testing for all devices for this since no model tests this use case currently.
  2. There is also block_ct_dim!=full_ct_dim. Currently conv tests have block_ct_dim=full_ct_dim, and this is not tested for WHB0 either, unit testing should be added for all devices, but not currently needed for CNNs for Blackhole bringup.
  3. 1x32 are used in maxpool tests and will have a unit test added for this (Specifically see 1x32 used in maxpool tests, have not seen compilation for 1x16).
  4. MATH(( llk_math_hw_configure_disaggregated<true>() )); will be added to the api init calls for the pack_untilize & fused reduce