tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
422 stars 54 forks source link

Galaxy Device ARC Hangs on Long Idle #9728

Closed ubcheema closed 2 months ago

ubcheema commented 3 months ago

When running Falcon7B Decoder on Galaxy in a loop, a Galaxy device ARC hangs and stops responding to NOC read/write requests. Observed when host SW is trying to communicate with ARC to put the device into Long Idle State at driver close.

Branch: ucheema/galaxy-driver-close-debug

The following bash loop will encounter a hang at some iteration:

#!/bin/bash

for i in {1..50}
do
 echo "Iteration: $i"
 pytest -svv models/demos/ttnn_falcon7b/tests/multi_chip/test_falcon_decoder.py::test_falcon_decoder[wormhole_b0-True-False-16-BFLOAT16-DRAM-tiiuae/falcon-7b-instruct-0.98-decode_batch32]
done
ubcheema commented 3 months ago

So, reading all ethernet cores is fine. But reading ARC core hangs. I get a hang before I go into the long idle message sequence. I am just reading heart beat from all ethernet cores, followed by a register read from arc X0Y10. ARC read never completes.

         Metal | INFO    | Cluster Destructor: Checking Device 11 Heartbeat Before Closing
         Metal | INFO    | Device 11 Eth (x=9,y=0) is Alive Before Closing. 2882356432 - 2882356448
         Metal | INFO    | Device 11 Eth (x=1,y=0) is Alive Before Closing. 2882358800 - 2882358814
         Metal | INFO    | Device 11 Eth (x=8,y=0) is Alive Before Closing. 2882357180 - 2882357193
         Metal | INFO    | Device 11 Eth (x=2,y=0) is Alive Before Closing. 2882364022 - 2882364035
         Metal | INFO    | Device 11 Eth (x=7,y=0) is Alive Before Closing. 2882390997 - 2882391011
         Metal | INFO    | Device 11 Eth (x=3,y=0) is Alive Before Closing. 2882362657 - 2882362671
         Metal | INFO    | Device 11 Eth (x=6,y=0) is Alive Before Closing. 2882366901 - 2882366914
         Metal | INFO    | Device 11 Eth (x=4,y=0) is Alive Before Closing. 2882361851 - 2882361866
         Metal | INFO    | Device 11 Eth (x=9,y=6) is Alive Before Closing. 2864419925 - 2864422241
         Metal | INFO    | Device 11 Eth (x=1,y=6) is Alive Before Closing. 2864384694 - 2864387202
         Metal | INFO    | Device 11 Eth (x=8,y=6) is Alive Before Closing. 2864433734 - 2864436049
         Metal | INFO    | Device 11 Eth (x=2,y=6) is Alive Before Closing. 2864399369 - 2864401875
         Metal | INFO    | Device 11 Eth (x=7,y=6) is Alive Before Closing. 2882363267 - 2882363280
         Metal | INFO    | Device 11 Eth (x=3,y=6) is Alive Before Closing. 2864414223 - 2864416727
         Metal | INFO    | Device 11 Eth (x=6,y=6) is Alive Before Closing. 2864382014 - 2864384523
         Metal | INFO    | Device 11 Eth (x=4,y=6) is Alive Before Closing. 2864439218 - 2864441721
         Metal | INFO    | Cluster Destructor: Checking Device 11 ARC Before Closing
         Metal | INFO    | Device 11 ARC (x=0,y=10) is Alive Before Closing. 82
         Metal | INFO    | Cluster Destructor: Checking Device 12 Heartbeat Before Closing
         Metal | INFO    | Device 12 Eth (x=9,y=0) is Alive Before Closing. 2864436385 - 2864440973
         Metal | INFO    | Device 12 Eth (x=1,y=0) is Alive Before Closing. 2864420272 - 2864424859
         Metal | INFO    | Device 12 Eth (x=8,y=0) is Alive Before Closing. 2864401286 - 2864405874
         Metal | INFO    | Device 12 Eth (x=2,y=0) is Alive Before Closing. 2864415114 - 2864419512
         Metal | INFO    | Device 12 Eth (x=7,y=0) is Alive Before Closing. 2864428216 - 2864432990
         Metal | INFO    | Device 12 Eth (x=3,y=0) is Alive Before Closing. 2864401872 - 2864406834
         Metal | INFO    | Device 12 Eth (x=6,y=0) is Alive Before Closing. 2864399327 - 2864403620
         Metal | INFO    | Device 12 Eth (x=4,y=0) is Alive Before Closing. 2864404464 - 2864409049
         Metal | INFO    | Device 12 Eth (x=9,y=6) is Alive Before Closing. 2882386808 - 2882386832
         Metal | INFO    | Device 12 Eth (x=1,y=6) is Alive Before Closing. 2882391773 - 2882391797
         Metal | INFO    | Device 12 Eth (x=8,y=6) is Alive Before Closing. 2882392944 - 2882392968
         Metal | INFO    | Device 12 Eth (x=2,y=6) is Alive Before Closing. 2882389077 - 2882389102
         Metal | INFO    | Device 12 Eth (x=7,y=6) is Alive Before Closing. 2882359779 - 2882359804
         Metal | INFO    | Device 12 Eth (x=3,y=6) is Alive Before Closing. 2882360045 - 2882360071
         Metal | INFO    | Device 12 Eth (x=6,y=6) is Alive Before Closing. 2882361934 - 2882361959
         Metal | INFO    | Device 12 Eth (x=4,y=6) is Alive Before Closing. 2882375557 - 2882375582
         Metal | INFO    | Cluster Destructor: Checking Device 12 ARC Before Closing

Device 11 everything is good. Device 12, all ethernet noc reads return. ARC noc read hangs.

ubcheema commented 3 months ago

@TTDRosen comments from slack:

To summarize… We confirmed that this is indeed a problem isolated to the ARC. (Parthiban was able to repro the hang even after removing all ARC noc accesses). And we are able to see that all of the eth router are accessible, which should cover the entirety of the noc. We also observe that it doesn’t seem to matter if the go busy/idle messages are sent at all. Finally there’s some indication that there is some voltage/frequency sensitivity, though we still have to figure out which it is.

Our next step is probably to disable all runtime arc features and see if it still repros, once the vf experiment finishes.

ubcheema commented 3 months ago

I can see the transaction in ethernet queue on the hung ARC Chip. ethernet has issued the noc transaction to ARC, but it's not completed.

In my test setup, I am reading from ARC address 0x880030074

Noc read is outstanding as shown by NIU_MST_REQS_OUTSTANDING_ID :

(my-env) ucheema@aus-glx-13:~/syseng/src/t6ifc/t6py$ read-noc --addr 0xffb20240 --num_words 16 --eth_id 12 --chip_x 0 --chip_y 7 --rack_x 0 --rack_y 1 --interface pci:3        
00: 0xffb20240 => 0x00000000
01: 0xffb20244 => 0x00000000
02: 0xffb20248 => 0x00000001
03: 0xffb2024c => 0x00000000
04: 0xffb20250 => 0x00000000
05: 0xffb20254 => 0x00000000
06: 0xffb20258 => 0x00000000
07: 0xffb2025c => 0x00000000
08: 0xffb20260 => 0x00000000
09: 0xffb20264 => 0x00000000
10: 0xffb20268 => 0x00000000
11: 0xffb2026c => 0x00000000
12: 0xffb20270 => 0x00000000
13: 0xffb20274 => 0x00000000
14: 0xffb20278 => 0x00000000
15: 0xffb2027c => 0x00000000

Read is issued with txn id 2 which is 0xFFB20248 above.

This is the noc read issued by eth core on noc cmd buf 3:

(my-env) ucheema@aus-glx-13:~/syseng/src/t6ifc/t6py$ read-noc --addr 0xffb20c00 --num_words 10 --eth_id 12 --chip_x 0 --chip_y 7 --rack_x 0 --rack_y 1 --interface pci:3        
00: 0xffb20c00 => 0x80030074
01: 0xffb20c04 => 0x00002808
02: 0xffb20c08 => 0x00000000
03: 0xffb20c0c => 0x00008d54
04: 0xffb20c10 => 0x00001870
05: 0xffb20c14 => 0x00000000
06: 0xffb20c18 => 0x00000800
07: 0xffb20c1c => 0x00003090
08: 0xffb20c20 => 0x00000004
09: 0xffb20c24 => 0x00000001
TTDRosen commented 3 months ago

Latest round of testing has been able to confirm a di/dt like signature to this issue.

Further steps related to this

  1. find a CLK/Voltage setting that pass
  2. determine if raising voltage at a failing frequency turns a fail into a pass (will rule out a race)
ubcheema commented 3 months ago

Issue has been root caused to be di/dt induced. Adding a 50mV Voltage Margin on Galaxy boards resolves the hangs. aus-glx-13 has run overnight with 1000+ cycles without a hang.

Syseng will release a galaxy package that is galaxy_fw_7.14.B.0_2024-02-09-5c5970967eec7826.tar.gz + 50mV Voltage Margin.

I have manually updated aus-glx-03, aus-glx-04, aus-glx-05, aus-glx-08, aus-glx-13.

ubcheema commented 3 months ago

Removing P0 since we have the voltage margin fix that resolves hangs.

ubcheema commented 2 months ago

Syseng has provided new galaxy package with voltage margin fix. /mnt/motor/syseng/bin/tt-flash/wh/mobo/galaxy_fw_7.14.C.0_2024-07-04-00b2b9f743dc1abb.tar.gz

Created issue to get galaxy machines updated: https://github.com/tenstorrent-metal/metal-internal-workflows/issues/249