Open hschoi4448 opened 2 months ago
@hschoi4448 your intuition about what is causing the problem is correct, the issue is indeed due to the fact that you are copying a tile to the 9th tile spot of the Dest register. I will try to clarify what is going wrong:
The destination register (Dest) is one big register that can fit a total of 16 tiles (assuming 16bit datums). In most cases we want to use double-buffering for Dest to improve performance, so this reduces the number of usable tiles to 8 at a time. The way double-buffering works is that tiles 0-7 are used by math, and then we switch to using tiles 8-15 while 0-7 are being packed out. Then we switch back to 0-7 while packing 8-15, etc. The tile_regs_commit();
and tile_regs_release();
calls are what switch which half of Dest is in use by the Math/Packer.
This is important because by copying a tile to index 8 of Dest, you are actually putting it into index "0" of the following operation, which causes the corruption you are seeing in this test. If you change const uint32_t mask_dst = 8;
to be any other value other than 8, you would no longer see the data corruption, because only index "0" of Dest is used by mul_tiles
. This is why when you are using double-buffering for dest, you can't use more than 8 tiles of dest for your operation without causing issues for the next op.
*Note that this is all for 16bit data, and if you are using 32bit data then all the number get divided by 2 (8 total tiles in Dest with 4 usable at one time if using double-buffering).
Let me know if you have further questions or need more explanation
Thank you for explaining the cause of the issue. However, I didn't want to know the cause of the problem; I created the issue because I believe that errors like this can negatively affect the development environment.
Here are the reasons why I think this is an issue:
Accessing an unusable register doesn't explicitly trigger an error, but silently corrupts the value.
According to the current copy_tile
documentation, it states that there are 16 dst registers.
It seems that some form of improvement regarding this problem is necessary.
@rdjogoTT @razorback3
I agree with your point regarding the need for better clarity regarding this issue, I'll defer the decision regarding the priority of this issue to @ttmtrajkovic.
hello @hschoi4448, @razorback3,
Thank you for flagging the problem, it's been basically baked into our programming model from day 1, since all acquire/release functions always operate on halfs only because of the parallel processing of packer and math thread and double-buffering scheme on dest registers (pack thread operates on one half while math thread operates on other). Having asserts for out of bounds tile id is something we'd definitely want to have, but I would argue the priority of this given that it has been part of the model for a long time and it shouldn't be gating progress.
Could you please downgrade to P2 bug that we will address once all P0/P1 bugs are cleared out. Let me know what you think.
Milos
Describe the bug While debugging the op, I discovered that the result of
mul_tiles
is incorrect.compute code: https://github.com/tenstorrent/tt-metal/blob/hyungsuk/debug_mul_tiles/ttnn/cpp/ttnn/deprecated/tt_dnn/op_library/moreh_layernorm/kernels/moreh_layernorm_small_kernel.cpp#L41
The test scenario is as follows:
input tensor
xinput tensor
This is output.
My guess is that the issue occurred because there are only 8 available destination registers, but a copy was made to the 9th register.
To Reproduce Steps to reproduce the behavior:
pytest ./tests/tt_eager/python_api_testing/unit_testing/misc/test_moreh_groupnorm.py
Please complete the following environment information: