Open sjameelTT opened 3 weeks ago
@nvelickovicTT @ncvetkovicTT do you guys know when you'll be able to get to this issue?
@sjameelTT we are currently looking into a related issue with tilize on BH. Most likely it is the same underlying cause as in your case, but in any case, we'll know more when we fix the one currently investigated. I am hoping to figure it out by the end of the week (it's a tricky bug that requires simulator to be used).
Thanks @nvelickovicTT , hoping that fixing the other issue solves this one!
From #12349, this blocks:
Escalating this issue to P0 due to blocking multiple items.
Hey @zzigler-tt @sjameelTT, I have pushed a workaround for this issue to the sjameel/transpose_tilize_bug
. The issue is with MATH and PACK not being synced after tilize_blocka and before transpose_wh_init_short. This can be resolved by using the full transpose_wh_init for now, but I am conducting an investigation of how we can fix this properly and why transpose_wh_init_short modifies the content of the intermed0 circular buffer. Please take a look and conduct some more tests on it, or tell me what I can do to test it more. Cheers!
thanks @ncvetkovicTT , let us know if you can get an ETA on how long a proper fix might take. If it is long then we will go with the workaround. If it's a couple days we can wait. Either is fine, it's just knowing the ETA will help us take the right decision.
thanks @ncvetkovicTT , let us know if you can get an ETA on how long a proper fix might take. If it is long then we will go with the workaround. If it's a couple days we can wait. Either is fine, it's just knowing the ETA will help us take the right decision.
You'll have the ETA on Monday <3
TL;DR: I think that I won't find a proper fix next week, so I say that we'll have the solution after 18 Nov.
Hey @sjameelTT @ntarafdar, I tried out some potential solutions for other problems that I thought might help with this one, but to no avail.
To give you a brief overview, there seems to be issues with synchronizing the Tensix cores in the kernel that's in use. I tried moving around some synchronization points in the underlying LLKs, but I have to do proper debugging. Instead of analyzing the LLKs visually and trying to figure out where the sync issue is, I'll probably have to look at some waveforms which will take some time to set up.
Introducing a couple of NOPs in transpose_wh_init_short fixes the issue. The way I would further debug this is to have one dump with a working number of NOPs and another with a failing number of NOPs. I hope this is possible to extract, but more likely there will be a gradual increase of failed runs as I reduce the number of NOPs. Since I have to set everything up, there will be a day or two of overhead work. @nvelickovicTT do you agree on the ETA?
Could the issue be arising from Blackhole being able to bundle 4 tensix instructions into a single packet?
@abhullar-tt I am not sure, but it seems like we're really missing some STALLWAIT somewhere.
@abhullar-tt are you referring to coalesced writes? That is an interesting perspective, we haven't tried disabling that and testing. @ncvetkovicTT there is a disable_gathering()enable_gathering() function pair in the code. We can try using it before/after kernel, just for sanity.
Hey @abhullar-tt, @sjameelTT and @zzigler-tt, you can find another branch called ncvetkovic/14609_transpose_tilize_bug which proposes another fix for the issue.
@rtawfik01 remembered that there was already an issue behaving like this. The real cause of the bug are not the out-of-sync PACK/MATH as I initially thought, but the fact that DEST_ACCESS_CFG_remap_addrs_RMW
and DEST_ACCESS_CFG_remap_addrs_RMW
bits that are accessed in _llk_math_hw_configure_
cannot be toggled due to poor latching of the register. By default we pass false
as untilize_en
argument when performing unary_op_init_common
, and then after tilize_block
we want to set it to true
when we call pack_untilize_dst_init_short
. The value of these bits doesn't really matter for our software, they don't affect the LLKs or anything, but they do make our lives miserable by disallowing us to print the dest properly.
I think that the issue where this was first mentioned is #12234, but I am not sure if this was really fixed or not, do you perhaps know @ttmtrajkovic?
@nvelickovicTT FYI
@ttmtrajkovic just to recap,DEST_ACCESS_CFG_remap_addrs_RMW
and DEST_ACCESS_CFG_remap_addrs_RMW
are the registers needed to enable strided packer operations. These registers are both set to true currently with pack_untilize and reduce inits, which were the locations pack_untilize was used or fused, but those flags are not set to true in other kernel inits such as binary ops, transpose_wh, etc.
Those flags can only be toggled when both math & pack are fully idle, so that means in fused kernels with pack_untilize, each init function will need syncs for both math & pack to be idle in order to reconfig those bits. Enabling those bits caused issues with destination register debug prints, and also potential increase in power draw for the destination reg. If the debug print issue is fixed, the above bits can be set to true for all ops and it will not impact functionality. Otherwise if we don't want these bits set as true for all ops, then the workaround here is to add CSR waits to make sure all transactions in-flight for both pack & math are complete.
@ncvetkovicTT, @rtawfik01, thanks for the explanations.
I think that the issue where this was first mentioned is #12234, but I am not sure if this was really fixed or not, do you perhaps know @ttmtrajkovic? @miacim has closed this bug and I believe there was a problem with repro. I dont think these are related, although they are referring to the same bits
I am not in favour of a solution where these bits would be set all the time and draw higher power. Power may be critical for BH, we are not sure yet, which is why I would prefer to properly craft a reconfigure sequence that would be safe. An API function that changes these bits should have built-in semaphores/CSR waits (whatever should work) to account for safe change in state. The perf cost is understandable, but it should be limited as we shouldn't be doing this back and forth all the time. This way, any reconfiguration of these bits will be safe, and perf cost would happen if we do it in the middle of the op.
@ncvetkovicTT, I've glanced through your workaround but I am not sure if its final or just something to enable OP team? I'd prefer to close this with proper safe mechanism so we don't stumble upon the same problem again.
@ncvetkovicTT, @rtawfik01, thanks for the explanations.
I think that the issue where this was first mentioned is #12234, but I am not sure if this was really fixed or not, do you perhaps know @ttmtrajkovic? @miacim has closed this bug and I believe there was a problem with repro. I dont think these are related, although they are referring to the same bits
I am not in favour of a solution where these bits would be set all the time and draw higher power. Power may be critical for BH, we are not sure yet, which is why I would prefer to properly craft a reconfigure sequence that would be safe. An API function that changes these bits should have built-in semaphores/CSR waits (whatever should work) to account for safe change in state. The perf cost is understandable, but it should be limited as we shouldn't be doing this back and forth all the time. This way, any reconfiguration of these bits will be safe, and perf cost would happen if we do it in the middle of the op.
@ncvetkovicTT, I've glanced through your workaround but I am not sure if its final or just something to enable OP team? I'd prefer to close this with proper safe mechanism so we don't stumble upon the same problem again.
@ttmtrajkovic alright thanks, we were not sure whether the performance hit or the power hit is less desirable here. Let me then introduce the waits instead. I have provided two workarounds, one that prohibits changing the register bits, and another that introduced additional waits through full transpose_init API.
@rtawfik01 @nvelickovicTT do you think that the waits should be added in _llk_math_hwconfig or somewhere else? What kind of waits should we use? We saw that tensix_sync + pc_buf_read[sem] works, but tensix_sync + CSR delay works as well. I don't really see a neat place to put this. Reem, remember how we got a couple of test iterations failing when we let the input go wild with randomness? Well, that's not happening if we don't toggle the bits, I should try again with the waits and see if that scenario repeats for 500 runs or so.
@ncvetkovicTT, How do you plan to do inter-thread synchronization with delays only, what does CSR delay wait on? Maybe BH has new types of waits, that can make a trisc from one thread wait on an event of another thread, but from what I know just delaying within a thread won't work. Keep in mind that adding delays in thread can seem like the issue is gone because timing is changed and you're no longer hitting a race.
Milos
@ncvetkovicTT, @rtawfik01, thanks for the explanations.
I think that the issue where this was first mentioned is #12234, but I am not sure if this was really fixed or not, do you perhaps know @ttmtrajkovic? @miacim has closed this bug and I believe there was a problem with repro. I dont think these are related, although they are referring to the same bits
I am not in favour of a solution where these bits would be set all the time and draw higher power. Power may be critical for BH, we are not sure yet, which is why I would prefer to properly craft a reconfigure sequence that would be safe. An API function that changes these bits should have built-in semaphores/CSR waits (whatever should work) to account for safe change in state. The perf cost is understandable, but it should be limited as we shouldn't be doing this back and forth all the time. This way, any reconfiguration of these bits will be safe, and perf cost would happen if we do it in the middle of the op.
@ncvetkovicTT, I've glanced through your workaround but I am not sure if its final or just something to enable OP team? I'd prefer to close this with proper safe mechanism so we don't stumble upon the same problem again.
It was initially problem that I could not repro issue with printing dest, but then I figure it out, added tests that expose issue and fixed bug with printing dest register.
Test works with [1, 1, 32, 32]
shape on the transpose but not with [16, 128, 8, 256]
. Going to investigate to see if that's still an issue with tilize_block or something else.
Test works with
[1, 1, 32, 32]
shape on the transpose but not with[16, 128, 8, 256]
. Going to investigate to see if that's still an issue with tilize_block or something else.
I suggest taking a look at how many tiles are processed in compute kernel per single LLK call. Keep in mind that you cannot fit more than 4 fp32 tiles in half of the DEST at a time.
@rtawfik01 @nvelickovicTT do you think that the waits should be added in _llk_math_hwconfig or somewhere else? What kind of waits should we use? We saw that tensix_sync + pc_buf_read[sem] works, but tensix_sync + CSR delay works as well. I don't really see a neat place to put this. Reem, remember how we got a couple of test iterations failing when we let the input go wild with randomness? Well, that's not happening if we don't toggle the bits, I should try again with the waits and see if that scenario repeats for 500 runs or so.
@ncvetkovicTT I think having cfg stall on math & pack, then having CSR waits until all in-flight pack & math transactions are completed would be the safe option, I remember that is what the hardware team mentioned but I will double check with them. And I think all of those additions should remain in the _llk_math_hw_config_
function, keep it with were the 2 registers are being modified.
Extra note here: we might also need to wait for unpacker transactions to not be in-flight if we have unpack-to-dest enabled & fused with pack_untilize, so this is another thing we would have to be aware of.
@ttmtrajkovic does the above sound alright? Or would you prefer @ncvetkovicTT initial solution which was waiting until the semaphores were decremented back to 0? Using the sempahores that way ensured that packer trisc was done with destination register, but there was still a possibility of packer & math instructions being popped out of the instruction buffer, but the transactions still being in-flight.
Hm so introducing a CFG stall on MATH and PACK together with CSR wait loop causes around 5-6% of test runs to fail. Waiting on both pack and sfpu/fpu reduces this percent to 2-3%, but there's no busy bit for math specifically, here's what I put there:
TTI_STALLWAIT(p_stall::STALL_CFG, p_stall::MATH|p_stall::PACK);
bstatus_u bs;
do
bs.val = csr_read<CSR::tensix_busy_status>();
while((bs.pack | bs._sfpu | bs.fpu) != 0);
Putting tensix_sync and waiting on MATH_PACK semaphore from TRISC (buy reading pc_buf) causes all 100% of test runs to pass. That's this code:
tensix_sync();
while (semaphore_read(semaphore::MATH_PACK) > 0) {
}; // Wait for previous packs to finish before claiming all dest
I thought that doing the same with SEMWAIT would also work, but that's not the case. Putting this causes the same behavior as CSR polling:
TTI_SEMWAIT(p_stall::STALL_MATH|p_stall::STALL_SFPU, semaphore::t6_sem(semaphore::MATH_PACK), p_stall::STALL_ON_MAX);
All of these observations don't depend on input values, I removed using fixed seed just to check.
Test works with
[1, 1, 32, 32]
shape on the transpose but not with[16, 128, 8, 256]
. Going to investigate to see if that's still an issue with tilize_block or something else.I suggest taking a look at how many tiles are processed in compute kernel per single LLK call. Keep in mind that you cannot fit more than 4 fp32 tiles in half of the DEST at a time.
Is there any difference between Wormhole and Blackhole here? This works on wormhole. This test also uses pack_untilize btw, is there anything different about that?
I modified the test to find the smallest shape I could that still has pcc errors: [8, 64, 8, 256]
. On sjameel/transpose_wh_testing
, you can test on the exact method I used in the original issue. You'll see a print before and after tilize_block. Before tilize_block, all the data for the tensor inside the [0,1,8,256]
slice view of the tensor is there, yet when we call tilize_block it results in 0s being pushed out.
Test works with
[1, 1, 32, 32]
shape on the transpose but not with[16, 128, 8, 256]
. Going to investigate to see if that's still an issue with tilize_block or something else.I suggest taking a look at how many tiles are processed in compute kernel per single LLK call. Keep in mind that you cannot fit more than 4 fp32 tiles in half of the DEST at a time.
Is there any difference between Wormhole and Blackhole here? This works on wormhole. This test also uses pack_untilize btw, is there anything different about that?
I modified the test to find the smallest shape I could that still has pcc errors:
[8, 64, 8, 256]
. Onsjameel/transpose_wh_testing
, you can test on the exact method I used in the original issue. You'll see a print before and after tilize_block. Before tilize_block, all the data for the tensor inside the[0,1,8,256]
slice view of the tensor is there, yet when we call tilize_block it results in 0s being pushed out.
I will take another look just to confirm, but as you can maybe conclude from the discussion above, the out-of-sync Tensix cores introduce an error which reconfigures DEST register access before the tilization is done. Notice that tilize_block doesn't work the same for WH and BH.
As for DEST capacity, they are the same - both fit 8 fp32_tiles in DstSync::SyncFull mode, and 4 fp32_tiles in DstSync::SyncHalf mode.
Edit: Yes, the shapes behave similarly as in this issue. I have explanation for some of the shapes failing, but I am not completely sure about all of them. Anyway, I believe that until we decide on how to resolve 1x1x32x32 issue first, we can't continue further, we have to have all possible variables fixed.
Hey, @sjameelTT @abhullar-tt, please find the workaround with proper syncs on your branch sjameel/transpose_tilize_bug. Once you guys test it out, I can merge to tt-llk-bh repo.
As you can see from my log, 100% of the test runs pass:
PASSED tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py::test_tranpose_hw_rm_no_padding[499-32-8-1-1]
====================== 500 passed, 500 warnings in 7.95s =======================
[38;2;000;128;000m Device[0m | [1m[38;2;100;149;237mINFO [0m | Closing user mode device drivers
Works for the [1, 1, 32, 32]
case at least, though not for the original case that's on main
@sjameelTT , @nvelickovicTT and I have agreed to create an umbrella issue for different problems that behave similarly. On multiple ocassions we observe the constraint where the product of N and C tensor dimensions exceeding core grid size (usually 130) causes tests to fail. Here's the issue - #15326.
We have agreed that once approved, we will merge PR#50 on tt-llk-bh
as a fix to the main cause of this issue, which is out of sync MATH and PACK.
To recreate
git checkout sjameel/transpose_tilize_bug
build
TT_METAL_DPRINT_ONE_FILE_PER_RISC=1 TT_METAL_DPRINT_CORES=0,0 pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_transpose.py::test_tranpose_hw_rm_no_padding
It will print these dprint outputs in
generated/dprint
Before tilize_block data:
after tilize (human readable)
Bottom 4 rows have incorrect values in the last 8 or so values.