Closed caixunshiren closed 4 months ago
fyi @uaydonat
@jliangTT to reassign and/or reprioritize
Hey @caixunshiren can you try the workarounds mentioned in this hang issue here: https://github.com/tenstorrent-metal/tt-metal/issues/7066
So basically: try to set allout_subblock_h=1
and out_subblock_w=1
for all matmuls. If that fixes your test you'd need to see which matmuls were the problematic ones. In Falcon40b this was necessary to avoid hangs for the FF1, LM Head, and Output Projection matmuls.
7066
Thanks for the info. We will give it a try!
Will downgrade to p1 to reflect priority for the runtime team - team is focusing on resolving the issue blocking some models near the point of readiness. --> https://docs.google.com/spreadsheets/d/1_11lY1B1L2BK-8dnKpCJHMYS_agVl5-3FubkpuHsRko/edit#gid=871704423 (internal link)
is this still a blocker @cglagovich ?
We released the demo in decode-only mode to workaround this hang. It's going to be a blocker soon when we want to release with prefill + decode. Maybe FD2 solves this
@cglagovich @caixunshiren we have been asked to try the Watcher and do the initial triaging for this hang. Does it look like a dispatcher hang or some other core is hanging? in which case FD2.0 will probably not fix the hang.
@cglagovich @caixunshiren we have been asked to try the Watcher and do the initial triaging for this hang. Does it look like a dispatcher hang or some other core is hanging? in which case FD2.0 will probably not fix the hang.
We've tried it with watcher enabled but the hang does not occur. It only occur when we disable the watcher. It could be that the watcher changes the race condition.
@pgkeller , you mentioned there are ways to enable watcher piece wise so it reduces the timing perturbance. Are there any instruction you can point the team to?
It looks like the getting started guide is up-to-date
https://tenstorrent.github.io/tt-metal/latest/tt-metalium/tools/watcher.html#enabling
Thanks, @uaydonat . We will continue investigating with the minimum set of watcher options
Used minimal watcher feature with 10 second updates to debug the hang.
export TT_METAL_WATCHER=10
export TT_METAL_WATCHER_DISABLE_ASSERT=1
export TT_METAL_WATCHER_DISABLE_PAUSE=1
export TT_METAL_WATCHER_DISABLE_RING_BUFFER=1
export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1
export TT_METAL_WATCHER_DISABLE_STATUS=1
Ran pytest -svv models/demos/llama2_70b/demo/eval.py::test_LlamaModel_demo[wikitext-2k-greedy-tt-70b-T3000-80L]
on latest main, to run prefill for 32 users for 2k tokens with full 80 layer model on sjc-snva-t3002
watcher.log shows repeating:
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatcher.cpp
k_id[3]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[4]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[5]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[81]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[82]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
k_id[97]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[98]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[99]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[100]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[101]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[102]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[103]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[104]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[105]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[106]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[107]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[108]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[109]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[110]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[111]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[112]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[113]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[114]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[0]: blank
k_id[30]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[31]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[32]: tt_metal/impl/dispatch/kernels/cq_dispatcher.cpp
k_id[93]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[94]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
Dump #64 completed at 1590.140s
hi @caixunshiren , can you help with create a repro unit test that can reproduce on an op?
can you add the per-core status to the post above? also, did you try running w/ DEBUG_STATUS still enabled? did that not repro?
Reproing with the watcher config, DEBUG_STATUS enabled.
export TT_METAL_WATCHER=10
export TT_METAL_WATCHER_DISABLE_ASSERT=1
export TT_METAL_WATCHER_DISABLE_PAUSE=1
export TT_METAL_WATCHER_DISABLE_RING_BUFFER=1
export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1
@SeanNijjar
I took a look at the log and it looks like the allgather isn't fully dispatched - could explain the hang?
For example, if you look at the log, you'll see that the allgather kernels are only running on devices 0, 6, and 7. Devices 1-5 aren't even running the allgather yet. They're just running the reader unary kernels:
k_id[0]: blank
k_id[19]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[81]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[82]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
On a different note, I am not sure why these other kernels are running at the same time as the all gather kernels:
k_id[73]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[74]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
For example look at this dump on device 0. It looks like 2 kernels are running simultaneously on the same chip?
k_id[0]: blank
k_id[3]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[73]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp ## THIS IS NOT AN ALLGATHER KERNEL
k_id[74]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp ## THIS IS NOT AN ALLGATHER KERNEL
k_id[89]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[90]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[91]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[92]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[93]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[94]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[95]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[96]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[97]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[98]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[99]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[100]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[101]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[102]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[103]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[104]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[105]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[106]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
The path to reader_unary_interleaved_start_id.cpp
and searching for it's references suggest it's not used in the dispatch/runtime datapath at all.
So my understanding is that this is a dispatch bug -> unless I've been misdirected in thinking that the execution model for metal is op-by-op per chip.
There are a few possibilities here: 1) Allgather finished on a couple chips and the eltwise kernels were kicked off, but allgather hung on a couple other chips 2) eltwise binary completed on some chips and allgather started getting kicked off, but eltwise hung 3) Dispatcher et. al. bug where it kicked off two ops on the same grid at the same time and it's causing problems
@cglagovich, do you have insight on the op before/after the allgather?
... Also.... I don't know where the dispatcher kernels went on some of these devices? They don't seem to be running? I was under the impression they're always running until finish?
Forgot to ping @aliuTT @pgkeller in the comment above. Do either of you guys have some insight into the watcher log above?
Is it ever expected to have multiple ops running on the same chip at the same time?
Does it ever make sense for cq_dispatcher to be missing on device 0? Is this expected or unexpected behaviour?
My understanding is unexpected but would like expert input!
There shouldn't be two ops active at once. It's possible that cq_dispatch sent op0 and while it's still running, it sent the next op1 for execution. Maybe some bug with the wait_for_workers
part of cq_dispatch..
The data shown above isn't quite enough to conclusively say this is a dispatch bug, though it is looking that way. Cores that don't run at all could display the kernel ids from the prior op. The per-core status will show the core state, or maybe you know already that all cores are expected to run these 2 ops? If so, then yes, looks like a dispatch bug.
Thanks @pgkeller, @aliuTT for the clarification. It looks like the eltwise kernels are probably all done (atleast from the ones I checked). However, I am not seeing any of the all-gather kernels being run on most of the chips and it looks like they are never kicked off because on the cores where I would expect them to be listed, I instead see the (presumed older) eltwise op kernels.
For example a chip where the allgather is running:
k_id[0]: blank
k_id[3]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[73]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[74]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
k_id[89]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[90]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[91]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[92]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[93]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[94]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[95]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[96]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[97]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[98]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[99]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[100]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[101]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[102]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[103]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[104]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[105]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[106]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
and then in the dumps above I see them listed where expected (y=2 row):
Device 0, Worker Core (x=0,y=0)[physical (x=1,y=2)]: CWFW, NSW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:92|91|0
Device 0, Worker Core (x=1,y=0)[physical (x=2,y=2)]: CWFW, NSW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:96|95|0
Device 0, Worker Core (x=2,y=0)[physical (x=3,y=2)]: CWFW, NSW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:100|99|0
Device 0, Worker Core (x=3,y=0)[physical (x=4,y=2)]: CWFW, NSW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:104|103|0
Device 0, Worker Core (x=4,y=0)[physical (x=6,y=2)]: CWFW,NSMW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:90|89|0
Device 0, Worker Core (x=5,y=0)[physical (x=7,y=2)]: CWFW,NSMW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:94|93|0
Device 0, Worker Core (x=6,y=0)[physical (x=8,y=2)]: CWFW,NSMW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:98|97|0
Device 0, Worker Core (x=7,y=0)[physical (x=9,y=2)]: CWFW,NSMW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:102|101|0
now on a different chip I see the following kernels (non-all-gather kernels)... also strangeley, where's cq_dispatcher?:
k_id[0]: blank
k_id[19]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[81]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[82]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
and the core dump above shows no all-gather kernels on the expected cores (instead showing the done unary interleaved kernels):
Device 2, Worker Core (x=0,y=0)[physical (x=1,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=1,y=0)[physical (x=2,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=2,y=0)[physical (x=3,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=3,y=0)[physical (x=4,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=4,y=0)[physical (x=6,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=5,y=0)[physical (x=7,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=6,y=0)[physical (x=8,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=7,y=0)[physical (x=9,y=1)]: GW, W, W, W, W rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
So based on the dumps above, it looks like the all-gather is not being fully deployed on all chips (and this is also consistent with hangs I've seen in other testing). For those reasons I think it's likely a dispatcher bug.
Also @cglagovich, I'm not sure how viable this is in terms of reproducibility (likely it will affect it), but if we add a DPRINT right after the call to erisc::datamover::eth_setup_handshake(handshake_addr, act_as_sender_in_handshake);
in tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp:193
, then we'll see if all the eriscs used by all-gather have completed handshaking.
We should see 16 outputs printed, but again, I imagine this may affect reproducibility (especially since the kernel sizes will change now with the addition of the dprint messages).
First Run
Allgather kernels kicked off on all chips but it looks like EDM (erisc data mover) kernel not entering kernel main for one of the cores on one of the chips (Post status in watcher log == 'R').
Now adding EDM DEBUG STATUS messages
Second Run:
Allgather EDM waiting to handshake with other end of link. Other end of link is device 3 and device 3 doesn't dispatch EDM (in fact, prior op is stuck?)
Device 3, Worker Core (x=0,y=4)[physical (x=1,y=5)]: NTW,CRBW,UABW,MWDD, R rmsg:D1G|bNT smsg:GGGG k_ids:0|95|96 ... Device 3, Ethnet Core (x=0,y=9)[physical (x=1,y=6)]: NWBD, X, X, X, X k_id:27 k_id[0]: blank k_id[27]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp k_id[95]: tt_eager/tt_dnn/op_library/softmax/kernels/dataflow/readed_unary_sharded_sm_causal_mask_hw_dims.cpp k_id[96]: tt_eager/tt_dnn/op_library/softmax/kernels/compute/softmax_sharded.cpp Device 4, Worker Core (x=0,y=0)[physical (x=1,y=1)]: CWFW, NSW, W, W, W rmsg:D1G|BNt smsg:GDDD k_ids:258|257|0
Third Run
- Allgather EDM waiting to handshake with other end of link. Other end of link is device 3 and device 3 is showing EDM Done (based on
DEBUG_STATUS
I added, other kernel is running.- It's possible that this EDM Done status was from the prior run and the second iteration EDM kernel was never started.
I'm going to see if I can isolate which iteration/invocation the EDM Done signal belongs to. I suspect that the op isn't being fully dispatched (due to prior kernel running).
Small update - I took another look at the watcher log and I'm noticing similarity with the one in #7824 where it looks like all-gather is stuck waiting for handshake. I'm going to try my recommendation above to prove conclusively one way or another.
I think we can consider marking this as duplicate of #7824, I'm seeing the same issue where all-gather is not being dispatched on all chips. I've confirmed that no all-gather finished so we aren't seeing some chips failing to finish the op, rather it's never fully starting.
I was seeing allgather starting on 6 of the 8 chips. Devices 3 and 4 did not properly dispatch the op, those are left and right chips, respectively.
The only difference from #7824 is that I don't see any kernels running on those chips where all-gather isn't yet dispatched, whereas in the other issue, I was seeing other ops/kernels still running on a core.
Some other hangs do not happen anymore after upgrading to the new firmware running at 1GHz (made available in https://github.com/tenstorrent/tt-metal/commit/13193c1247a8713b9bd77e4856fe97cf14d89202)
Can you guys re-try this?
Yes we can try this
Next step:
I was able to repro the hang with the new FW at 1GHz.
Next step is to repro with watcher at 1GHz using the same watcher options as last time.
There's an unfortunate side effect of reproducing the hang at 1GHz. Twice now, tt-smi reset fails after hanging. Yesterday, this led 3/4 PCIe devices to become undetectable, requiring a power cycle and card re-seat.
(venv) cglagovich@sjc-snva-t3002:~/tt-metal/scripts/install$ tt-smi -r 0,1,2,3
Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3
Traceback (most recent call last):
File "/home/cglagovich/tt-metal/scripts/install/venv/bin/tt-smi", line 8, in <module>
sys.exit(main())
File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 733, in main
pci_board_reset(args.reset, reinit=True)
File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_smi/tt_smi_backend.py", line 544, in pci_board_reset
reset_devices = WHChipReset().full_lds_reset(pci_interfaces=reset_wh_pci_idx)
File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_tools_common/reset_common/wh_reset.py", line 83, in full_lds_reset
refclk_list.append(read_refclk_counter(chip))
File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_tools_common/utils_common/tools_utils.py", line 189, in read_refclk_counter
high1 = chip.axi_read32(high_addr)
Exception: Read 0xffffffff from ARC scratch[6]: you should reset the board.
I recovered the machine via uninstall kmod, reinstall kmod, host reboot.
This seems very reproducible with 1GHz clock - yikes. I'll do it once more with watcher enabled
1GHz prefill hang repro looks similar to last repro.
example kernels running on each device:
device 1: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
device 6: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
device 2: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
device 5: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
device 3: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
device 4: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
device 0: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
device 7: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
Each core which is doing some interleaved writing is in GW
mode, go wait. I assume this means they have received the next program, and are waiting for some signal before progressing?
Each core which is doing allgather is waiting for something, CWFW, NSW,
for example. Might mean that these cores are hung and are keeping other chips from progressing?
Hey @cglagovich, this is inline with what I was seeing. If you look at some of the devices in the log (e.g. device 4), you'll see that the allgather kernels are never dispatched. This means the allgather can't possibly complete.
I verified on my end a couple days ago that the situation is that allgather is never being programmed/executed on that (and another) device.
What's not happening is the allgather is done on device 4 and it's failing to finish on other devices.
This hang repro succeeded in bringing down another t3000 machine. tt-smi reset fails, and we need a power cycle to recover.
woa. can you document the procedure to reproduce this and we need to flag this syseng right away?
c.c. @ttmtrajkovic @davorchap @TT-billteng @tt-rkim @tapspatel (awareness that 1GHz + model hang can completely crash the system)
Working with Daniel Rosen on this problem in slack https://tenstorrent.slack.com/archives/C04CL97VAEA/p1714760345522079
turns out that power cycle does not always resolve the problem
repro procedure:
So far, this has been reproed on sjc-snva-t3002
and sjc-lab-t3004
Shoot - is this not in CI nor a private test that we gave to @tapspatel to run before verifying 1GHz upgrade?
This is a hanging test which hung on 800 MHz which we just tried to repro on 1GHz. Not in CI and not run privately by Taps since it's known to hang
@cglagovich, Host cold boot (power cycle) may be required if noc hung and traffic to pcie from host backs up. That isn’t necessarily syseng problem but a bug in software. Have you tried commenting out busy message and running at a very low 500MHz clock? @taps may know how to do it, and if it still hangs than it definitely indicates a SW problem.
After a power cycle, both machines fail to run tt-smi
(.tools_env) cglagovich@sjc-lab-t3004-special-cglagovich-for-reservation-2140484:~/tt-metal/scripts/install$ tt-smi
Detected Chips: 6
Error in detecting devices!
After power cycle, running tt-topology -l mesh
is necessary before tt-smi will work. These machines are back online
@cglagovich let me know if you want to try 500mhz, I can set that up.
the hang at 1GHz seems very different than the original. Should we spin up another issue to debug this? Otherwise we might lose track what issue we are debugging.
Daniel Rosen seems to believe that the failure to reset the cards is a tt-smi issue. Agreed it's strange that this failure to reset occurs at 1GHz but not 800MHz. I'd say let's keep the discussion alive in this thread while we repro with FD2.0.
Ran this hanging test on main with FD2.0 @ 1GHz and I was unable to repro the hang over the course of 2 hours. To compare, this test used to hang within the first 15 minutes of running. Possibly resolved with FD2, but I'll have to run again overnight for more confidence.
great, fingers crossed. would love for FD2 to resolve a P0 (or two or...)
@cglagovich @uaydonat @ttmtrajkovic , decent chance this is also di / dt
Data:
I'm unable to repro this hang on main. Closing, but feel free to reopen if you believe there are actions to take.
Description We observer ND hanging with Llama2-70B models during prefill for 2K sequence length on 80 layers, occurs around after refilling 6 to 10 users. We notice the following behavior:
model-team/demo-perf
, which was rebased to main yesterday.To Reproduce Steps to reproduce the behavior: Device: T3000 Checkout to
model-team/demo-perf
build tt-metal run