Closed AleksKnezevic closed 3 months ago
I was able to repro the hang with watcher enabled on tt-metal-ci-vm-32
, watcher log as follows:
47117 Dump #574 at 580.439s
47118 Device 0, Core (x=1,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47119 Device 0, Core (x=2,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47120 Device 0, Core (x=3,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47121 Device 0, Core (x=4,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47122 Device 0, Core (x=6,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47123 Device 0, Core (x=7,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47124 Device 0, Core (x=8,y=1): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47125 Device 0, Core (x=9,y=1): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47126 Device 0, Core (x=1,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47127 Device 0, Core (x=2,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47128 Device 0, Core (x=3,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47129 Device 0, Core (x=4,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47130 Device 0, Core (x=6,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47131 Device 0, Core (x=7,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47132 Device 0, Core (x=8,y=2): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47133 Device 0, Core (x=9,y=2): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47134 Device 0, Core (x=1,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47135 Device 0, Core (x=2,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47136 Device 0, Core (x=3,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47137 Device 0, Core (x=4,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47138 Device 0, Core (x=6,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47139 Device 0, Core (x=7,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47140 Device 0, Core (x=8,y=3): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47141 Device 0, Core (x=9,y=3): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47142 Device 0, Core (x=1,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47143 Device 0, Core (x=2,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47144 Device 0, Core (x=3,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47145 Device 0, Core (x=4,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47146 Device 0, Core (x=6,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47147 Device 0, Core (x=7,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47148 Device 0, Core (x=8,y=4): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47149 Device 0, Core (x=9,y=4): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47150 Device 0, Core (x=1,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47151 Device 0, Core (x=2,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47152 Device 0, Core (x=3,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47153 Device 0, Core (x=4,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47154 Device 0, Core (x=6,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47155 Device 0, Core (x=7,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47156 Device 0, Core (x=8,y=5): NSW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47157 Device 0, Core (x=9,y=5): CRBW,CRBW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47158 Device 0, Core (x=1,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47159 Device 0, Core (x=2,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47160 Device 0, Core (x=3,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47161 Device 0, Core (x=4,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47162 Device 0, Core (x=6,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47163 Device 0, Core (x=7,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47164 Device 0, Core (x=8,y=7): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47165 Device 0, Core (x=9,y=7): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47166 Device 0, Core (x=1,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47167 Device 0, Core (x=2,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47168 Device 0, Core (x=3,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47169 Device 0, Core (x=4,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47170 Device 0, Core (x=6,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47171 Device 0, Core (x=7,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47172 Device 0, Core (x=8,y=8): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47173 Device 0, Core (x=9,y=8): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47174 Device 0, Core (x=1,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47175 Device 0, Core (x=2,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47176 Device 0, Core (x=3,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47177 Device 0, Core (x=4,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47178 Device 0, Core (x=6,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47179 Device 0, Core (x=7,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47180 Device 0, Core (x=8,y=9): NSW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47181 Device 0, Core (x=9,y=9): CRBW,NSW,R,R,R rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47182 Device 0, Core (x=1,y=10): QW,W,W,W,W rmsg:H0G|Bnt smsg:DDDD k_ids:2|0|0
47183 Device 0, Core (x=2,y=10): NRBD,W,W,W,W rmsg:H0G|Bnt smsg:DDDD k_ids:1|0|0
47184 Device 0, Core (x=3,y=10): GW,W,W,W,W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47185 Device 0, Core (x=4,y=10): GW,W,W,W,W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47186 Device 0, Core (x=6,y=10): GW,W,W,W,W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47187 Device 0, Core (x=7,y=10): GW,W,W,W,W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47188 Device 0, Core (x=8,y=10): GW,W,W,W,W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47189 Device 0, Core (x=9,y=10): GW,W,W,W,W rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47190 k_id[0]: blank
47191 k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
47192 k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatcher.cpp
47193 k_id[2944]: tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp
47194 k_id[2945]: tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp
47195 k_id[2946]: tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp
47196 k_id[2947]: tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
47197 Dump #574 completed at 580.444s
Thanks for the help David!
This is a matmul hang. @TT-BrianLiu, can you please take a look? I still can't repro on my end with watcher, but @tt-dma can give you access to the machine he's using.
Check with @ttmchiou for access to the machine, not sure if I can give access
@TT-BrianLiu send me your SSH Key so I can net you access to the machine david used
Given that watcher "fixes" the issue on one machine and doesn't on another, suspect some sort of race/timing-related issue that is perturbed by the additional code/noc transactions inserted by watcher
Does it always error out on that machine with watcher enabled? Also, on other machines, does it always hang without watcher and always pass with watcher?
On other machines I've always seen it hang without watcher pass with.
Only ran once on that machine & saw the hang with watcher enabled. Let me know if you are not able to reproduce
Please file issue under global project boards
It seems the hang is related to TtLMSDiscreteScheduler
. The demo script runs without hang after I replaced it with LMSDiscreteScheduler
from diffuser package (commit).
@boris-drazic, have we brought up and tested the demo with the on-device scheduler? Is there a unit test we can try?
Hi @AleksKnezevic, as @mtatsumiTT pointed the hang is related to TtLMSDiscreteScheduler, to be more specific the test hangs when the flow enters the following if
condition in the TtLMSDiscreteScheduler's
step function.
if derivative_tensor.shape[0] > 1:
derivative_tensor = ttnn.permute(derivative_tensor, (3, 1, 2, 0))
derivative_tensor = ttnn.sum(derivative_tensor, dim=-1)
derivative_tensor = ttnn.permute(derivative_tensor, (3, 1, 2, 0))
The flow enters this condition and fails to perform the permute. Even after converting the ttnn.permute to PyTorch permute, hang occurs.
@Sudharsan-V my guess is that we're seeing the hang there because we're trying to copy the tensor back from device to do a fallback permute and the hang occurs in a previous op. Do we have a unit test for the scheduler? Does it hang? Have you tried it with watcher or slow dispatch?
@AleksKnezevic , since it is a helper function, unit test is not available for scheduler. No, I didn't try with watcher or slow dispatch.
Latest data:
@AleksKnezevic , any update on this one? (last update was to produce smaller repro for debugging, did it get handed off to someone to investgiate & root cause)
Pardon the delayed response, we thought the issue went away, but it is still present on some machines. @mtatsumiTT, can you please rebase and try and repro on your end again?
@mtatsumiTT , do we have any update here?
I tried to repro the hang on different B0 machines (bgd-lab-10
/bgd-lab-15
/bgd-lab-11
/yyz-lab-71
), but the demo script passes on all devices. Also, the model-perf that runs the demo script passes (log:cnn-javelin) on the branch after rebasing, so I presume the hang is already addressed.
let's downgrade until we see more sighting.
is this issue stale? can we close?
It's still hanging according to @tt-rkim but I don't think it's a runtime issue, most likely di/dt related. Closing.
Stable diffusion demo hangs without watcher but runs fine with.
To repro checkout
aknezevic/sd_demo_hang
: and runpytest -svv models/experimental/functional_stable_diffusion/demo/demo.py::test_demo_diffusiondb
.If watcher is enabled, the test runs to completion:
TT_METAL_WATCHER=1 pytest -svv models/experimental/functional_stable_diffusion/demo/demo.py::test_demo_diffusiondb
.@tt-dma, can you please take a look?
@boris-drazic and @tt-nshanker FYI.