tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
366 stars 36 forks source link

Block Matmul non-deterministic hang with WH - back to 1GH #4968

Open yugaoTT opened 6 months ago

yugaoTT commented 6 months ago

When using the block based api matmul_block for WH, we spot a non-deterministic hang when subblock dim > 1. When subblock_h (rt_dim) and subblock_w (ct_dim) both equal to 1, there is no hang. when subblock_w > 1, it hangs randomly.

To Reproduce

go to branch main run the pytest pytest tests/tt_eager/python_api_testing/unit_testing/test_bert_ops.py

jliangTT commented 6 months ago

@yugaoTT , any update here? Since it is P0_showstopper, i don't think it is great to leave it 5 days without updating.

yugaoTT commented 6 months ago

@jliangTT Here is what we find since: With interm/output cb inplace removed, the L1 corruption gone We think it is the sync between packer and math causing the hang.

no cb_reserve, no pack_tile, no hangs
no pack_tile, no hangs
no partial pack_tile, hangs
Delay matmul_block by 500 iterations, no hang
Delay matmul_block by 100 iterations, no hang
Delay matmul_block by 50 iterations, no hang
Delay matmul_block by 10 iterations, hang

Delay matmul_pack by 500 iterations, no hang
Delay matmul_pack by 100 iterations, hang
Delay matmul_pack by 10 iterations, hang

Delay in0 mcast by 500 iterations, hang
Delay in0 mcast by 100 iterations, hang
Delay in0 mcast by 10 iterations, hang
jliangTT commented 6 months ago

Is this an LLK team work item? Is it still a P0?

yugaoTT commented 6 months ago

@jliangTT it is still being debugged by me and @rtawfik01, we narrowed down to single core hang.

jliangTT commented 6 months ago

sounds good. following it in the slack now. thanks for driving this.

yugaoTT commented 6 months ago

Updates on ND hangs:

  1. We were able to slow down the ai clk to 500 Mhz and the matmul test passed for 5000 iterations in fast dispatch mode. Previously in 1000Mhz it hangs at 100-200 iterations.
  2. @ttmtrajkovic and @acejkov found the frequency range: 918 and higher hangs, 909 and lower passes
  3. next steps: more test runs to confirm freq range. adding staggering in the dispatch cores to see if that affects fmax. provide repro instructions to syseng team to check on their end.
SeanNijjar commented 6 months ago

With very little information, this is sounding like the hang I'm looking at Buda side. Hasn't been root caused but while investigating, I found a bug in a workaround for a hardware bug, tracked in gitlab issue budabackend/issues/2348

This bug had to do with RAW hazard scenarios. For sub-word stores and loads, there is a hardware bug that could accidentally reorder the read before the write, even if they target the same address. @pgkeller put in a gcc pass to kill the reordering inside the risc but the pass missed some cases. We're waiting on a fix to close those gaps.

The one thing I'm seeing different here is that on BUDA I was decoupling math and pack and doing so didn't resolve the hang - so maybe it's different. Also I'm so far only able to reproduce on Galaxy.

@yugaoTT are you always seeing the hang on the same core?

yugaoTT commented 6 months ago

@SeanNijjar The hang is not always on the same core, although from what I have seen, it usually at a specific row (row 4) Were you able to repro this hang on single core? Also, if the ND hang we are seeing is also related to the RAW hazard, does reduce frequency affect it?

SeanNijjar commented 6 months ago

@SeanNijjar The hang is not always on the same core, although from what I have seen, it usually at a specific row (row 4) Were you able to repro this hang on single core?

I was never able to reproduce this on a small setup with a single chip and/or core. However, given that the kernel code paths we are working with are probably quite different, I wouldn't necessarily find it that surprising.

All hangs were on galaxy machines with large-grid ops (8x8) and hangs would be anywhere from 1-4 hung cores. Sometimes those cores were in a row but not always. The hung chip could change from run to run. The way the test is setup it wouldn't really be possible for multiple chips to be hung at the same time.

There were definitely some rows that seemed more likely to hit the hang (like noc0 rows 4,5,7, and 9)

Also, if the ND hang we are seeing is also related to the RAW hazard, does reduce frequency affect it?

I'm not sure. I think this would depend on clock domains between noc, tensix, triscs, L1, which I have no idea about. At the very least, slower clocks change timing of communication with host which can change when different event sequences kick off.

One other nugget which may apply is that when a given core is hung, usually only a couple of tiles have made their way through math from the looks of it (say < 10), so it's not really in the "middle" of a given kernel. Some other obvious differences are that we are using overlay but I'm not seeing any issues whatsoever on the overlay side. Also on that note, it's typically a "middle" op in the workload, so other similar ops (actually basically identical) have already run on the hung cores prior, most of the time.

Finally, one other thing to share is that I originally didn't have the partial workaround for the hardware bug. I would always hit the hang (still non-deterministic as to when during the workload). After the partial workaround, I started to see passing runs. The partial workaround is available in sfpi master. Maybe it's worth dropping the last 2-3 commits to see if your behaviour changes (gets worse). It's not a bullet proof data-point but it would be interesting if it more reliably hung (mirroring my hang).

Message on side-channel for details. I'm not sure about what's the right process for sharing cross repo links/references in github.

ttmtrajkovic commented 6 months ago

@SeanNijjar, @TTDRosen,

One important difference between the two failures (Sean’s and this one) is that Sean has tried to reproduce the problem at a lower clock and it was still failing. With this test case, early results (still to be confirmed) are that frequency sensitivity exists and that failing frequency is determined.

milos

yugaoTT commented 6 months ago

Update on reproduce: go to branch yugao/gs_wh_block_matmul_hang

To run the test with hangs (AI clk set to 1000Mhz): run pytest tests/tt_eager/python_api_testing/unit_testing/test_bert_ops.py --timeout 6000

To run test without hangs (set AI clk to 500Mhz)

  1. In tt_metal/third_party/umd/device/tt_silicon_driver.cpp, line 4362 comment out set_power_state(tt_DevicePowerState::BUSY);
  2. make clean, make build should bring the AI clk down to 500Mhz
  3. run pytest tests/tt_eager/python_api_testing/unit_testing/test_bert_ops.py --timeout 6000
pgkeller commented 6 months ago

I pushed a change to branch pkeller/wh-hang-4968 to allow staggering the start of each core at kernel launch. Cherry pick the last commit to the branch of interest. The diffs are trivial, functionality is: set DELAY_CYCLES_PER_Y to n to have each row stall by an extra n cycles. No need to rebuild after each change (the file is runtime built).

yugaoTT commented 6 months ago

@pgkeller Thank you, I'll do experiments based on the changes

pgkeller commented 6 months ago

btw: I was wondering yesterday if a more robust solution would be to increase the current draw when running any application. on device init we could start the triscs in a loop that sends some work to tensix while waiting for the next kernel. this isn't fully robust, though, as we could have a kernel w/ low work load followed by a kernel w/ high work load

yugaoTT commented 6 months ago

I tried increasing the initial stagger to 10K cycles, it still hangs around 150 iteration. One thing weird is when I enable the DPRINT on one core, the DPRINT server seems to fail much sooner (around 30 iterations) and stop printing.

ttmtrajkovic commented 6 months ago

I'd suggest we run an extreme number, like million cycles, as 10k is only 10us and that may not be enough.

On another note, are individual rows of cores doing multicasts or gathers for activations to make sure entire inner dim is collected on every core? If every core could do the gather, then we could stagger gather commands from every core which would effectively ensure that data arrives later on some cores in a row.

Is that an option @davorchap, @pgkeller?

yugaoTT commented 6 months ago

I tried 100k and still hangs.

The cores in a row are doing mcast send/recv for activations. If we change to gather, then each core would fetch blocks from all other cores in a row, even if there is stagger, at some point, I think all the core will perform gather at the same time.

yugaoTT commented 6 months ago

1M cycles hangs too.

ttmtrajkovic commented 6 months ago

thanks @yugaoTT. I believe we are not doing exactly as slow dispatch does and there's gotta be something that helps in case of slow dispatch. I'd focus next experiments there.

Regarding gather vs mcast, if there was a way to control the start of gather, then cores starting gather later would effectively get data later and therefore start doing math later. I may be missing something

ttmtrajkovic commented 6 months ago

@davorchap,

this should be downgraded from P0_Showstopper to P1 item where different staggering methods need to be explored

davorchap commented 6 months ago

@davorchap,

this should be downgraded from P0_Showstopper to P1 item where different staggering methods need to be explored

Done & re-assigned.

ttmtrajkovic commented 3 months ago

I have the SPI ROM firmware ready from syseng team with added voltage margin to run at 1GHz. They have tested it against the failing stressful matmul 1d/2d, across SS, TT and FF parts, also confirmed with scope captures.

I’d like to test this a bit more with some more full model/demo workloads so I’d appreciate if someone could provide me with the list of tests to run. I can run resnet, but its currently not running well on wormhole (~1k fps) so I am not sure how stressful it’s going to be.

The bottom line is that this is not a bullet-proof solution as we’ve established that not only severity of the workload matters, but also frequency of droops. For example, by running matmul 2d test in a test loop, allowing more time between iterations wasn’t causing the failure, while more frequent iterations would cause a fail. Explanation is that we could be hitting multiple levels of droop, and if workload would stop and then restart while the board droop is still happening, we are more likely to cause a fail.

This means that as fast dispatch becomes available and demos start to be more performant end to end, we need to make sure they don’t fail. It’s hard to differentiate between a functional hang and di/dt hang but that doesn’t mean that every hang should be attributed to this as its unlikely that voltage margin is not enough. Sensitivity to clocks and voltage will give an ultimate answer.