tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
488 stars 80 forks source link

Untilize LLK produces incorrect results on BH #14594

Open jaykru-tt opened 3 weeks ago

jaykru-tt commented 3 weeks ago

I've got what looks like an issue with pack_untilize_block on BH. I'm untilizing a 261 tile tensor (shape is [1,1,261*32,32]) and I'm seeing wrong results on the second tile (starting at the 33rd row). I've confirmed that the right data is read into the input CB that pack_untilize_block uses, then after it's run I see 10 bf16s all zero, followed by correct data (starting from the 11th element of the row).

I've made changes adding debug prints to the program factory and kernels as well as minimized test cases test_untilize_test.py. These can be found in the branch jkruer/untilize_debug.

jaykru-tt commented 3 weeks ago

@nvelickovicTT @ncvetkovicTT did either of you have a chance to check this out? Any idea of a rough ETA?

ncvetkovicTT commented 3 weeks ago

@nvelickovicTT @ncvetkovicTT did either of you have a chance to check this out? Any idea of a rough ETA?

Hey @jaykru-tt, I will take a look at it today. Once I have a better understanding of the issue I will let you know of the ETA.

ncvetkovicTT commented 3 weeks ago

Hi @jaykru-tt, I tried setting use_multicore to False and all the shapes that you provided pass for both data formats. Could you tell me more about how the input gets mapped to grid? Like, I see that the grid is 13 by 10 but why is that? If we could somehow make the test fail for single-core or isolate the issue better, it would decrease the time for us to find issues on LLK side, if they exist.

ntarafdar commented 1 week ago

@jaykru-tt could you look into this

jaykru-tt commented 1 week ago

@ncvetkovicTT were you able to reproduce the failure in a minimized test case with use_multicore=true? Also, I don't fully follow your question about the mapping of the input to grid. I'm going to try running this with use_multicore=false and see if we can at least use that for now.

ncvetkovicTT commented 1 week ago

@ncvetkovicTT were you able to reproduce the failure in a minimized test case with use_multicore=true? Also, I don't fully follow your question about the mapping of the input to grid. I'm going to try running this with use_multicore=false and see if we can at least use that for now.

I wasn't, I put use_multicore=true and tried out a lot of input shapes, including a single tile onwards, and I didn't see the fail. Is this the case on your end as well when you use single core?

jaykru-tt commented 1 week ago

@ncvetkovicTT

I put use_multicore=true and tried out a lot of input shapes, including a single tile onwards, and I didn't see the fail.

This was with the test I sent over? i.e. something like b1 = ttnn.untilize(a, memory_config=out_mem_config, use_multicore=True, use_pack_untilize=True) and comparing with the pure-Python golden untilize? And on a BH? If so, very confused why we're seeing different results.

Is this the case on your end as well when you use single core?

Finally got around to testing this out. No, I'm not seeing it on single core. It looks like the single core untilize op is using the same compute kernels, hence LLKs. Is there any chance of interference between the cores for this LLK?

ncvetkovicTT commented 1 week ago

Hey @jaykru-tt, sorry for not documenting my findings in a more elaborate way, I will correct my mistake now. Let me answer your questions: the first part of previous reply will be covered by my explanation, for the second part - no, I don't think that LLKs have any connection with the core layout, number of cores used or anything similar, I would rather look at compute, reader and writer kernels.

Now for the long explanation. Conditions in which I conducted the tests are such that I commented out the part of the code that checks for first_non_equal_row, because the our_res/golden_res difference is quite small and is probably caused by rounding errors. Although the tensors are not literally equal, the torch.equal and comp_pcc calls set passing1 to True, making the overall test to pass.

Now, with that first_non_equal_row out of the way, and everything else intact, I ran the original shapes that are in your branch, and this is the result for WH (I have the full output but I will shorten it for readability):

===
4 failed, 32 passed, 36 warnings in 14.08s 
===
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-8-bfloat16] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-8-float] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-7-bfloat16] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-7-float] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Now for BH, the same test condfitions cause some tests to actually fail because passing1 is False:

===
12 failed, 24 passed, 36 warnings in 14.65s
===
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-348-1-bfloat16] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-348-1-float] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-1-bfloat16] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-1-float] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-8-bfloat16] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-8-float] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-7-bfloat16] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-7-float] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-2-bfloat16] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-2-float] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-3136-2-bfloat16] - assert False

FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-3136-2-float] - assert False

Okay, let's tweak the shapes a bit like I did in my branch (ncvetkovic/14594-pack_untilize_block_issue) and think about them. The first few demonstrate that the size of the block determined by nw allocated to a single core is limited by L1 or CB capacity. Then, we increase the number of cores assigned by increasing nh while keeping block size to 1x1 [1 by 1 tiles], and once we exceed 130 cores we expect that the following block will be used by core 0,0 (or whatever the logic is, but the point is that once all cores finish the first block, they move to the second if there is one). We can confirm that we tell the cores that there are up to 2 blocks that they need to process by reading nblocks_per_core which is 1 for nh < 131 and 2 for nh < 261 and so on. Now, when we want the cores to process up to three 1x1 blocks, the test breaks on Blackhole and passes on Wormhole.

If we continue to play around with shape dimensions and increase nw>2 while keeping nh = 261 we see that the test passes now. Anyway, we need to understand how the tiles are mapped to cores, because for use_multicore=False I couldn't find a shape that fails the test.

I am not familiar with our whole SW stack, nor I reject the possibility that this might be an LLK issue, but the only way I could think that compute LLKs might be faulty is if there are several sequential kernel calls and the state of the HW is preserved between them, so that we don't initialize something correctly. In order to be sure, I really need a case where a single tile processed on a single core fails, or at least a number of tiles but when use_multicore=False.

ncvetkovicTT commented 2 days ago

@jaykru-tt The PR#15398 brings this test to behave the same as in WH. Please see ncvetkovic/14594-pack_untilize_block_issue or rebase jkruer/untilize_debug to the latest tt-metal main to witness the same result. After rebase, here's the output of the tests, with all of the shapes from ncvetkovic/14594-pack_untilize_block_issue branch (notice that the same shapes fail in WH and BH, at this point I am not sure why is that and it may be due to test implementation or smth):

================================================================================== short test summary info ==================================================================================
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-1-10-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-1-10-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-32-1-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-32-1-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-32-32-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-32-32-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-130-128-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-130-128-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-131-128-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-131-128-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-260-1-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-260-1-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-260-3-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-260-3-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-260-5-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-260-5-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-1-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-1-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-3-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-3-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-5-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-261-5-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-391-5-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-391-5-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-521-5-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-521-5-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-521-3-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-521-3-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-651-3-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-651-3-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-1-2-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-1-2-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-1-11-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-1-11-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-7-8-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-7-8-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-49-1-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-49-1-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-49-16-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-49-16-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-49-32-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-49-32-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-196-4-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-196-4-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-196-8-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-196-8-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-196-16-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-196-16-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-2-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-2-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-4-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-4-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-8-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-784-8-float]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-3136-2-bfloat16]
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[1-1-3136-2-float]
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-8-bfloat16] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-8-float] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-7-bfloat16] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_untilize_test.py::test_run_untilize_test[5-2-4-7-float] - RuntimeError: Boolean value of Tensor with more than one value is ambiguous
======================================================================== 4 failed, 56 passed, 60 warnings in 30.74s =========================================================================

Let me know if I can close the issue, but as far as I'm concerned, the LLKs no longer cause WH and BH to behave differently for this test.