tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
481 stars 78 forks source link

Reinitialization of Tilize/Untilize Operations Fails When Changing # Out Tiles #14973

Open wransom-TT opened 1 week ago

wransom-TT commented 1 week ago

Describe the bug When processing tensors with C dimensions that are larger than 8 but not multiples of 8 tiles, the resulting output is wrong. This seems to be due to incomplete re-initialization of the tilize/untilize operations in the compute kernel since for instance if we have 12 tiles, we need to first process 8 tiles, then 4 tiles.

To Reproduce

  1. checkout wransom/TwelveTileBug
  2. build the code ./build_metal.sh --debug
  3. run the max pool unit test pytest tests/ttnn/unit_tests/operations/test_maxpool2d.py which runs a simple test case on 1 core
  4. the test should fail, but should also print register output from right after the reduce operation (see line 80 of max_pool_multi_core.cpp)

From here we need to set the stage a bit to understand the current debugging information. Because we have 12 C tiles, reduce_h_fused in max_pool_multi_core.cpp will be called twice. The first time it should process the first 8 tiles, the second time it should process the remaining 4 tiles. (both C = 4 tiles and C = 16 tile cases work, but C = 12 does not). Furthermore, we have 4 sticks since HW=2x2. So, we will see a total of 8 tiles printed to the console from the reduce output registers. In the branches default state it is set to print register data from tile 0 so, for the 12 tile test case we are working with, we will see all 8 printed tiles with a top row equal to "4" (the correct value). If we change line 80 of max_pool_multi_core.cpp to print tile 4 instead, we will see every other tile printed with 0 as the top row. This is what we should expect since we only have 12 tiles and are printing tile index 4, which on the second pass of reduce_h_fused exceeds the total of 12 tiles.

So all this seems to imply that the output of reduce is correct, and that the issue may lie in the untilize procedure.

Note in the branches default state, we are not reinitializing the tilize operation (we are only reinitializing the utilize operation (see lines 138, 144, 150, 156 of max_pool_multi_core.cpp. Also note that the commented out calls to tilizeA_B_reduce_init_short on lines 138 and 150 reference a newly implemented, untested function which may not behave as intended. Nonetheless it is expected that the tilize operation will need to be re-initialized as well, but the results are the same at least for this simple case whether these calls to tilize are used or not.

Expected behavior With the updates in the referenced branch, a tensor with a 12 tile C dimensions should work.

Please complete the following environment information:

rtawfik01 commented 1 week ago

@ttmtrajkovic This issue is about being able to use tilize + untilize combination for different number of tiles in one fused kernel. To recap:

  1. Tilize + reduce + untilize for 4 tiles works standalone
  2. Tilize + reduce + untilize for 8 tiles works standalone
  3. Initializing tilize + reduce + untilize for 4 tiles, then re-initializing for 8 tiles does not work (or vice versa).

The above is a new addition for Wormhole B0, and I think we should isolate each of these ops into respective tests to check whether re-initialization in the same kernel can work, so that means:

  1. Adding a Tilize unit test that does 4 tiles, then re-initializing to 8 tiles (or vice versa)
  2. Adding a Untilize unit test that does 4 tiles, then re-initializing to 8 tiles (or vice versa)
  3. Same thing for reduce for sanity

@rdjogoTT @amahmudTT fyi for wormhole kernels.

In the meantime @wransom-TT @mywoodstock , please make the smallest possible version of the test above, so that would be with single core, and only 12 tiles in the test (1 iteration), and please add the appropriate priority.

wransom-TT commented 1 week ago

@rtawfik01 the test in this branch is the smallest possible failing example. I have set it to 1 core (this is done in the python test it's easy to miss see line 163 of tests/ttnn/unit_tests/operations/test_maxpool2d.py), 12 tiles, N = 1 and HW=2x2 (we don't see the issue with 2x1 or 1x1).

Do you know what the ETA looks like for this fix? We may try to implement a workaround in the meantime.

rtawfik01 commented 1 week ago

@wransom-TT thanks for the smaller test, currently I don't have an ETA since most of the team has P0 bugs they are working on, I'll let you know once someone is available to look at this issue.

mywoodstock commented 1 week ago

@rtawfik01 Thanks! Please let us know when this will be visited -- meanwhile we will try to use workarounds.