caixunshiren commented 7 months ago

Description We observer ND hanging with Llama2-70B models during prefill for 2K sequence length on 80 layers, occurs around after refilling 6 to 10 users. We notice the following behavior:

Hang does not occur with watcher enabled which is similar to what #6795 reported.
Hang does not occur with smaller sequence length (e.g. 128)
Hang occur regardless of program cache enabled or disabled (note that with program cache enabled, the hang usually occurs earlier). Our work is in model-team/demo-perf, which was rebased to main yesterday.

To Reproduce Steps to reproduce the behavior: Device: T3000 Checkout to model-team/demo-perf build tt-metal run

pytest -svv models/demos/llama2_70b/demo/eval.py::test_LlamaModel_demo[wikitext-2k-greedy-tt-70b-T3000-80L] 2>&1 | tee perplexity_80L_watcher.log

cglagovich commented 7 months ago

fyi @uaydonat

cglagovichTT commented 7 months ago

@jliangTT to reassign and/or reprioritize

johanna-rock-tt commented 7 months ago

Hey @caixunshiren can you try the workarounds mentioned in this hang issue here: https://github.com/tenstorrent-metal/tt-metal/issues/7066

So basically: try to set allout_subblock_h=1and out_subblock_w=1 for all matmuls. If that fixes your test you'd need to see which matmuls were the problematic ones. In Falcon40b this was necessary to avoid hangs for the FF1, LM Head, and Output Projection matmuls.

caixunshiren commented 7 months ago

7066

Thanks for the info. We will give it a try!

jliangTT commented 6 months ago

Will downgrade to p1 to reflect priority for the runtime team - team is focusing on resolving the issue blocking some models near the point of readiness. --> https://docs.google.com/spreadsheets/d/1_11lY1B1L2BK-8dnKpCJHMYS_agVl5-3FubkpuHsRko/edit#gid=871704423 (internal link)

uaydonat commented 6 months ago

is this still a blocker @cglagovich ?

cglagovichTT commented 6 months ago

We released the demo in decode-only mode to workaround this hang. It's going to be a blocker soon when we want to release with prefill + decode. Maybe FD2 solves this

uaydonat commented 6 months ago

@cglagovich @caixunshiren we have been asked to try the Watcher and do the initial triaging for this hang. Does it look like a dispatcher hang or some other core is hanging? in which case FD2.0 will probably not fix the hang.

caixunshiren commented 6 months ago

@cglagovich @caixunshiren we have been asked to try the Watcher and do the initial triaging for this hang. Does it look like a dispatcher hang or some other core is hanging? in which case FD2.0 will probably not fix the hang.

We've tried it with watcher enabled but the hang does not occur. It only occur when we disable the watcher. It could be that the watcher changes the race condition.

jliangTT commented 6 months ago

@pgkeller , you mentioned there are ways to enable watcher piece wise so it reduces the timing perturbance. Are there any instruction you can point the team to?

uaydonat commented 6 months ago

It looks like the getting started guide is up-to-date

https://tenstorrent.github.io/tt-metal/latest/tt-metalium/tools/watcher.html#enabling

cglagovichTT commented 6 months ago

Thanks, @uaydonat . We will continue investigating with the minimum set of watcher options

kevinmiTT11 commented 6 months ago

Used minimal watcher feature with 10 second updates to debug the hang.

export TT_METAL_WATCHER=10
export TT_METAL_WATCHER_DISABLE_ASSERT=1
export TT_METAL_WATCHER_DISABLE_PAUSE=1
export TT_METAL_WATCHER_DISABLE_RING_BUFFER=1
export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1
export TT_METAL_WATCHER_DISABLE_STATUS=1

Ran pytest -svv models/demos/llama2_70b/demo/eval.py::test_LlamaModel_demo[wikitext-2k-greedy-tt-70b-T3000-80L] on latest main, to run prefill for 32 users for 2k tokens with full 80 layer model on sjc-snva-t3002

In the first run, Hangs at all gather kernels for user 0

watcher.log shows repeating:

k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatcher.cpp
k_id[3]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[4]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[5]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[81]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[82]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
k_id[97]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[98]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[99]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[100]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[101]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[102]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[103]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[104]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[105]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[106]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[107]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[108]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[109]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[110]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[111]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[112]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[113]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[114]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp

k_id[0]: blank
k_id[30]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[31]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
k_id[32]: tt_metal/impl/dispatch/kernels/cq_dispatcher.cpp
k_id[93]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[94]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
Dump #64 completed at 1590.140s

Second run hangs also at all_gather at the 4th user.
Third consequent run finished prefill for all 32 users.
After a night, a new run hangs at the first user at all_gather showing the same watcher log.

jliangTT commented 6 months ago

hi @caixunshiren , can you help with create a repro unit test that can reproduce on an op?

pgkeller commented 6 months ago

can you add the per-core status to the post above? also, did you try running w/ DEBUG_STATUS still enabled? did that not repro?

cglagovichTT commented 6 months ago

Reproing with the watcher config, DEBUG_STATUS enabled.

export TT_METAL_WATCHER=10
export TT_METAL_WATCHER_DISABLE_ASSERT=1
export TT_METAL_WATCHER_DISABLE_PAUSE=1
export TT_METAL_WATCHER_DISABLE_RING_BUFFER=1
export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1

watcher.log

cglagovichTT commented 6 months ago

@SeanNijjar

SeanNijjar commented 6 months ago

I took a look at the log and it looks like the allgather isn't fully dispatched - could explain the hang?

For example, if you look at the log, you'll see that the allgather kernels are only running on devices 0, 6, and 7. Devices 1-5 aren't even running the allgather yet. They're just running the reader unary kernels:

k_id[0]: blank
k_id[19]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[81]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[82]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp

On a different note, I am not sure why these other kernels are running at the same time as the all gather kernels:

k_id[73]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[74]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp

For example look at this dump on device 0. It looks like 2 kernels are running simultaneously on the same chip?

k_id[0]: blank
k_id[3]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[73]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp      ## THIS IS NOT AN ALLGATHER KERNEL
k_id[74]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp      ## THIS IS NOT AN ALLGATHER KERNEL
k_id[89]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[90]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[91]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[92]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[93]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[94]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[95]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[96]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[97]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[98]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[99]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[100]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[101]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[102]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[103]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[104]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[105]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[106]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp

The path to reader_unary_interleaved_start_id.cpp and searching for it's references suggest it's not used in the dispatch/runtime datapath at all.

So my understanding is that this is a dispatch bug -> unless I've been misdirected in thinking that the execution model for metal is op-by-op per chip.

There are a few possibilities here: 1) Allgather finished on a couple chips and the eltwise kernels were kicked off, but allgather hung on a couple other chips 2) eltwise binary completed on some chips and allgather started getting kicked off, but eltwise hung 3) Dispatcher et. al. bug where it kicked off two ops on the same grid at the same time and it's causing problems

@cglagovich, do you have insight on the op before/after the allgather?

... Also.... I don't know where the dispatcher kernels went on some of these devices? They don't seem to be running? I was under the impression they're always running until finish?

SeanNijjar commented 6 months ago

Forgot to ping @aliuTT @pgkeller in the comment above. Do either of you guys have some insight into the watcher log above?

Is it ever expected to have multiple ops running on the same chip at the same time?

Does it ever make sense for cq_dispatcher to be missing on device 0? Is this expected or unexpected behaviour?

My understanding is unexpected but would like expert input!

aliuTT commented 6 months ago

There shouldn't be two ops active at once. It's possible that cq_dispatch sent op0 and while it's still running, it sent the next op1 for execution. Maybe some bug with the wait_for_workers part of cq_dispatch..

pgkeller commented 6 months ago

The data shown above isn't quite enough to conclusively say this is a dispatch bug, though it is looking that way. Cores that don't run at all could display the kernel ids from the prior op. The per-core status will show the core state, or maybe you know already that all cores are expected to run these 2 ops? If so, then yes, looks like a dispatch bug.

SeanNijjar commented 6 months ago

Thanks @pgkeller, @aliuTT for the clarification. It looks like the eltwise kernels are probably all done (atleast from the ones I checked). However, I am not seeing any of the all-gather kernels being run on most of the chips and it looks like they are never kicked off because on the cores where I would expect them to be listed, I instead see the (presumed older) eltwise op kernels.

For example a chip where the allgather is running:

k_id[0]: blank
k_id[3]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[73]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[74]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp
k_id[89]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[90]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[91]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[92]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[93]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[94]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[95]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[96]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[97]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[98]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[99]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[100]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[101]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_reader.cpp
k_id[102]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_send_writer.cpp
k_id[103]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_reader.cpp
k_id[104]: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp
k_id[105]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp
k_id[106]: tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp

and then in the dumps above I see them listed where expected (y=2 row):

Device 0, Worker Core (x=0,y=0)[physical (x=1,y=2)]:   CWFW, NSW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:92|91|0
Device 0, Worker Core (x=1,y=0)[physical (x=2,y=2)]:   CWFW, NSW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:96|95|0
Device 0, Worker Core (x=2,y=0)[physical (x=3,y=2)]:   CWFW, NSW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:100|99|0
Device 0, Worker Core (x=3,y=0)[physical (x=4,y=2)]:   CWFW, NSW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:104|103|0
Device 0, Worker Core (x=4,y=0)[physical (x=6,y=2)]:   CWFW,NSMW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:90|89|0
Device 0, Worker Core (x=5,y=0)[physical (x=7,y=2)]:   CWFW,NSMW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:94|93|0
Device 0, Worker Core (x=6,y=0)[physical (x=8,y=2)]:   CWFW,NSMW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:98|97|0
Device 0, Worker Core (x=7,y=0)[physical (x=9,y=2)]:   CWFW,NSMW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:102|101|0

now on a different chip I see the following kernels (non-all-gather kernels)... also strangeley, where's cq_dispatcher?:

k_id[0]: blank
k_id[19]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[81]: tt_eager/tt_dnn/kernels/dataflow/reader_unary_interleaved_start_id.cpp
k_id[82]: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp

and the core dump above shows no all-gather kernels on the expected cores (instead showing the done unary interleaved kernels):

Device 2, Worker Core (x=0,y=0)[physical (x=1,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=1,y=0)[physical (x=2,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=2,y=0)[physical (x=3,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=3,y=0)[physical (x=4,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=4,y=0)[physical (x=6,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=5,y=0)[physical (x=7,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=6,y=0)[physical (x=8,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0
Device 2, Worker Core (x=7,y=0)[physical (x=9,y=1)]:     GW,   W,   W,   W,   W  rmsg:D1D|BNt smsg:DDDD k_ids:82|81|0

So based on the dumps above, it looks like the all-gather is not being fully deployed on all chips (and this is also consistent with hangs I've seen in other testing). For those reasons I think it's likely a dispatcher bug.

SeanNijjar commented 6 months ago

Also @cglagovich, I'm not sure how viable this is in terms of reproducibility (likely it will affect it), but if we add a DPRINT right after the call to erisc::datamover::eth_setup_handshake(handshake_addr, act_as_sender_in_handshake); in tt_eager/tt_dnn/op_library/ccl/edm/erisc_datamover.cpp:193, then we'll see if all the eriscs used by all-gather have completed handshaking.

We should see 16 outputs printed, but again, I imagine this may affect reproducibility (especially since the kernel sizes will change now with the addition of the dprint messages).

SeanNijjar commented 6 months ago

PLEASE IGNORE - ACCIDENTALLY POSTED HERE INSTEAD OF INTENDED ISSUE #7824

First Run

Allgather kernels kicked off on all chips but it looks like EDM (erisc data mover) kernel not entering kernel main for one of the cores on one of the chips (Post status in watcher log == 'R').

Now adding EDM DEBUG STATUS messages

Second Run:

Allgather EDM waiting to handshake with other end of link. Other end of link is device 3 and device 3 doesn't dispatch EDM (in fact, prior op is stuck?)
Device 3, Worker Core (x=0,y=4)[physical (x=1,y=5)]:    NTW,CRBW,UABW,MWDD,   R  rmsg:D1G|bNT smsg:GGGG k_ids:0|95|96
...
Device 3, Ethnet Core (x=0,y=9)[physical (x=1,y=6)]:   NWBD,   X,   X,   X,   X  k_id:27
k_id[0]: blank
k_id[27]: tt_metal/impl/dispatch/kernels/command_queue_bidirectional_tunneller.cpp
k_id[95]: tt_eager/tt_dnn/op_library/softmax/kernels/dataflow/readed_unary_sharded_sm_causal_mask_hw_dims.cpp
k_id[96]: tt_eager/tt_dnn/op_library/softmax/kernels/compute/softmax_sharded.cpp
Device 4, Worker Core (x=0,y=0)[physical (x=1,y=1)]:   CWFW, NSW,   W,   W,   W  rmsg:D1G|BNt smsg:GDDD k_ids:258|257|0
Third Run

Allgather EDM waiting to handshake with other end of link. Other end of link is device 3 and device 3 is showing EDM Done (based on DEBUG_STATUSI added, other kernel is running.

It's possible that this EDM Done status was from the prior run and the second iteration EDM kernel was never started.

I'm going to see if I can isolate which iteration/invocation the EDM Done signal belongs to. I suspect that the op isn't being fully dispatched (due to prior kernel running).

SeanNijjar commented 6 months ago

Small update - I took another look at the watcher log and I'm noticing similarity with the one in #7824 where it looks like all-gather is stuck waiting for handshake. I'm going to try my recommendation above to prove conclusively one way or another.

SeanNijjar commented 6 months ago

I think we can consider marking this as duplicate of #7824, I'm seeing the same issue where all-gather is not being dispatched on all chips. I've confirmed that no all-gather finished so we aren't seeing some chips failing to finish the op, rather it's never fully starting.

I was seeing allgather starting on 6 of the 8 chips. Devices 3 and 4 did not properly dispatch the op, those are left and right chips, respectively.

The only difference from #7824 is that I don't see any kernels running on those chips where all-gather isn't yet dispatched, whereas in the other issue, I was seeing other ops/kernels still running on a core.

uaydonat commented 6 months ago

Some other hangs do not happen anymore after upgrading to the new firmware running at 1GHz (made available in https://github.com/tenstorrent/tt-metal/commit/13193c1247a8713b9bd77e4856fe97cf14d89202)

Can you guys re-try this?

cglagovichTT commented 6 months ago

Yes we can try this

jliangTT commented 6 months ago

Next step:

debug / repro with watcher (using Paul's info) to extract more info
test against the latest firmware/main (1GHz w/ raised voltge)
need help to isolate to see if it is dispatcher
- serialize the op execution (device synchronize after each ops) - if it is passing, it is likely a FD bug

cglagovichTT commented 6 months ago

I was able to repro the hang with the new FW at 1GHz.

Next step is to repro with watcher at 1GHz using the same watcher options as last time.

cglagovichTT commented 6 months ago

There's an unfortunate side effect of reproducing the hang at 1GHz. Twice now, tt-smi reset fails after hanging. Yesterday, this led 3/4 PCIe devices to become undetectable, requiring a power cycle and card re-seat.

(venv) cglagovich@sjc-snva-t3002:~/tt-metal/scripts/install$ tt-smi -r 0,1,2,3
 Starting pci link reset on WH devices at pci indices: 0, 1, 2, 3
Traceback (most recent call last):
  File "/home/cglagovich/tt-metal/scripts/install/venv/bin/tt-smi", line 8, in <module>
    sys.exit(main())
  File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_smi/tt_smi.py", line 733, in main
    pci_board_reset(args.reset, reinit=True)
  File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_smi/tt_smi_backend.py", line 544, in pci_board_reset
    reset_devices = WHChipReset().full_lds_reset(pci_interfaces=reset_wh_pci_idx)
  File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_tools_common/reset_common/wh_reset.py", line 83, in full_lds_reset
    refclk_list.append(read_refclk_counter(chip))
  File "/home/cglagovich/tt-metal/scripts/install/venv/lib/python3.8/site-packages/tt_tools_common/utils_common/tools_utils.py", line 189, in read_refclk_counter
    high1 = chip.axi_read32(high_addr)
Exception: Read 0xffffffff from ARC scratch[6]: you should reset the board.

cglagovichTT commented 6 months ago

I recovered the machine via uninstall kmod, reinstall kmod, host reboot.

This seems very reproducible with 1GHz clock - yikes. I'll do it once more with watcher enabled

cglagovichTT commented 6 months ago

1GHz prefill hang repro looks similar to last repro.

example kernels running on each device:

device 1: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp device 6: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp device 2: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp device 5: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp device 3: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp device 4: tt_eager/tt_dnn/kernels/dataflow/writer_unary_interleaved_start_id.cpp device 0: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp device 7: tt_eager/tt_dnn/op_library/all_gather/kernels/dataflow/worker_interleaved_ring_gather_receive_writer.cpp

Each core which is doing some interleaved writing is in GW mode, go wait. I assume this means they have received the next program, and are waiting for some signal before progressing? Each core which is doing allgather is waiting for something, CWFW, NSW, for example. Might mean that these cores are hung and are keeping other chips from progressing?

watcher_prefill_1ghz.log

SeanNijjar commented 6 months ago

Hey @cglagovich, this is inline with what I was seeing. If you look at some of the devices in the log (e.g. device 4), you'll see that the allgather kernels are never dispatched. This means the allgather can't possibly complete.

I verified on my end a couple days ago that the situation is that allgather is never being programmed/executed on that (and another) device.

What's not happening is the allgather is done on device 4 and it's failing to finish on other devices.

cglagovichTT commented 6 months ago

This hang repro succeeded in bringing down another t3000 machine. tt-smi reset fails, and we need a power cycle to recover.

jliangTT commented 6 months ago

woa. can you document the procedure to reproduce this and we need to flag this syseng right away?

c.c. @ttmtrajkovic @davorchap @TT-billteng @tt-rkim @tapspatel (awareness that 1GHz + model hang can completely crash the system)

cglagovichTT commented 6 months ago

Working with Daniel Rosen on this problem in slack https://tenstorrent.slack.com/archives/C04CL97VAEA/p1714760345522079

turns out that power cycle does not always resolve the problem

repro procedure:

upgrade to 1GHz FW
run the hanging test in this issue

So far, this has been reproed on sjc-snva-t3002 and sjc-lab-t3004

tt-rkim commented 6 months ago

Shoot - is this not in CI nor a private test that we gave to @tapspatel to run before verifying 1GHz upgrade?

cglagovichTT commented 6 months ago

This is a hanging test which hung on 800 MHz which we just tried to repro on 1GHz. Not in CI and not run privately by Taps since it's known to hang

ttmtrajkovic commented 6 months ago

@cglagovich, Host cold boot (power cycle) may be required if noc hung and traffic to pcie from host backs up. That isn’t necessarily syseng problem but a bug in software. Have you tried commenting out busy message and running at a very low 500MHz clock? @taps may know how to do it, and if it still hangs than it definitely indicates a SW problem.

cglagovichTT commented 6 months ago

After a power cycle, both machines fail to run tt-smi

(.tools_env) cglagovich@sjc-lab-t3004-special-cglagovich-for-reservation-2140484:~/tt-metal/scripts/install$ tt-smi
 Detected Chips: 6
 Error in detecting devices!

cglagovichTT commented 6 months ago

After power cycle, running tt-topology -l mesh is necessary before tt-smi will work. These machines are back online

tapspatel commented 6 months ago

@cglagovich let me know if you want to try 500mhz, I can set that up.

jliangTT commented 6 months ago

the hang at 1GHz seems very different than the original. Should we spin up another issue to debug this? Otherwise we might lose track what issue we are debugging.

cglagovichTT commented 6 months ago

Daniel Rosen seems to believe that the failure to reset the cards is a tt-smi issue. Agreed it's strange that this failure to reset occurs at 1GHz but not 800MHz. I'd say let's keep the discussion alive in this thread while we repro with FD2.0.

cglagovichTT commented 6 months ago

Ran this hanging test on main with FD2.0 @ 1GHz and I was unable to repro the hang over the course of 2 hours. To compare, this test used to hang within the first 15 minutes of running. Possibly resolved with FD2, but I'll have to run again overnight for more confidence.

pgkeller commented 6 months ago

great, fingers crossed. would love for FD2 to resolve a P0 (or two or...)

davorchap commented 6 months ago

@cglagovich @uaydonat @ttmtrajkovic , decent chance this is also di / dt

Data:

Hang does not occur with watcher enabled which is similar to what https://github.com/tenstorrent/tt-metal/issues/6795 reported. -> watcher slows down MMs, hang gone
Hang does not occur with smaller sequence length (e.g. 128) -> smaller seq_len makes MM DRAM bound and slower
Hang occur regardless of program cache enabled or disabled (note that with program cache enabled, the hang usually occurs earlier). Our work is in model-team/demo-perf, which was rebased to main yesterday. --> program cache on/off doesn't impact speed up MMs, hang still there

cglagovichTT commented 4 months ago

I'm unable to repro this hang on main. Closing, but feel free to reopen if you believe there are actions to take.

tenstorrent / tt-metal

Llama2-70B Prefill on 2K sequence Length Non-deterministic Hang #7135

PLEASE IGNORE - ACCIDENTALLY POSTED HERE INSTEAD OF INTENDED ISSUE #7824

First Run

Now adding EDM DEBUG STATUS messages

Second Run:

Third Run