tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
463 stars 71 forks source link

Hang of stable diffusion demo #6713

Closed AleksKnezevic closed 3 months ago

AleksKnezevic commented 7 months ago

Stable diffusion demo hangs without watcher but runs fine with.

To repro checkout aknezevic/sd_demo_hang: and run pytest -svv models/experimental/functional_stable_diffusion/demo/demo.py::test_demo_diffusiondb.

If watcher is enabled, the test runs to completion: TT_METAL_WATCHER=1 pytest -svv models/experimental/functional_stable_diffusion/demo/demo.py::test_demo_diffusiondb.

@tt-dma, can you please take a look?

@boris-drazic and @tt-nshanker FYI.

tt-dma commented 7 months ago

I was able to repro the hang with watcher enabled on tt-metal-ci-vm-32, watcher log as follows:

47117 Dump #574 at 580.439s
47118 Device 0, Core (x=1,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47119 Device 0, Core (x=2,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47120 Device 0, Core (x=3,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47121 Device 0, Core (x=4,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47122 Device 0, Core (x=6,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47123 Device 0, Core (x=7,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47124 Device 0, Core (x=8,y=1):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47125 Device 0, Core (x=9,y=1):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47126 Device 0, Core (x=1,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47127 Device 0, Core (x=2,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47128 Device 0, Core (x=3,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47129 Device 0, Core (x=4,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47130 Device 0, Core (x=6,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47131 Device 0, Core (x=7,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47132 Device 0, Core (x=8,y=2):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47133 Device 0, Core (x=9,y=2):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47134 Device 0, Core (x=1,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47135 Device 0, Core (x=2,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47136 Device 0, Core (x=3,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47137 Device 0, Core (x=4,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47138 Device 0, Core (x=6,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47139 Device 0, Core (x=7,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47140 Device 0, Core (x=8,y=3):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47141 Device 0, Core (x=9,y=3):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47142 Device 0, Core (x=1,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47143 Device 0, Core (x=2,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47144 Device 0, Core (x=3,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47145 Device 0, Core (x=4,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47146 Device 0, Core (x=6,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47147 Device 0, Core (x=7,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47148 Device 0, Core (x=8,y=4):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47149 Device 0, Core (x=9,y=4):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47150 Device 0, Core (x=1,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47151 Device 0, Core (x=2,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47152 Device 0, Core (x=3,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47153 Device 0, Core (x=4,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47154 Device 0, Core (x=6,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47155 Device 0, Core (x=7,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47156 Device 0, Core (x=8,y=5):    NSW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47157 Device 0, Core (x=9,y=5):    CRBW,CRBW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47158 Device 0, Core (x=1,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47159 Device 0, Core (x=2,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47160 Device 0, Core (x=3,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47161 Device 0, Core (x=4,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47162 Device 0, Core (x=6,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47163 Device 0, Core (x=7,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47164 Device 0, Core (x=8,y=7):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47165 Device 0, Core (x=9,y=7):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47166 Device 0, Core (x=1,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47167 Device 0, Core (x=2,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47168 Device 0, Core (x=3,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47169 Device 0, Core (x=4,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47170 Device 0, Core (x=6,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47171 Device 0, Core (x=7,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47172 Device 0, Core (x=8,y=8):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47173 Device 0, Core (x=9,y=8):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47174 Device 0, Core (x=1,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2945|2944|2947
47175 Device 0, Core (x=2,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47176 Device 0, Core (x=3,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47177 Device 0, Core (x=4,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47178 Device 0, Core (x=6,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47179 Device 0, Core (x=7,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47180 Device 0, Core (x=8,y=9):    NSW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47181 Device 0, Core (x=9,y=9):    CRBW,NSW,R,R,R  rmsg:D0G|BNT smsg:GGGG k_ids:2946|2944|2947
47182 Device 0, Core (x=1,y=10):   QW,W,W,W,W  rmsg:H0G|Bnt smsg:DDDD k_ids:2|0|0
47183 Device 0, Core (x=2,y=10):   NRBD,W,W,W,W  rmsg:H0G|Bnt smsg:DDDD k_ids:1|0|0
47184 Device 0, Core (x=3,y=10):   GW,W,W,W,W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47185 Device 0, Core (x=4,y=10):   GW,W,W,W,W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47186 Device 0, Core (x=6,y=10):   GW,W,W,W,W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47187 Device 0, Core (x=7,y=10):   GW,W,W,W,W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47188 Device 0, Core (x=8,y=10):   GW,W,W,W,W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47189 Device 0, Core (x=9,y=10):   GW,W,W,W,W  rmsg:H0D|bnt smsg:DDDD k_ids:0|0|0
47190 k_id[0]: blank
47191 k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetcher.cpp
47192 k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatcher.cpp
47193 k_id[2944]: tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in0_sender_receiver_padding_block_sharded.cpp
47194 k_id[2945]: tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp
47195 k_id[2946]: tt_eager/tt_dnn/op_library/bmm/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp
47196 k_id[2947]: tt_eager/tt_dnn/op_library/bmm/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
47197 Dump #574 completed at 580.444s
AleksKnezevic commented 7 months ago

Thanks for the help David!

This is a matmul hang. @TT-BrianLiu, can you please take a look? I still can't repro on my end with watcher, but @tt-dma can give you access to the machine he's using.

tt-dma commented 7 months ago

Check with @ttmchiou for access to the machine, not sure if I can give access

ttmchiou commented 7 months ago

@TT-BrianLiu send me your SSH Key so I can net you access to the machine david used

tt-dma commented 7 months ago

Given that watcher "fixes" the issue on one machine and doesn't on another, suspect some sort of race/timing-related issue that is perturbed by the additional code/noc transactions inserted by watcher

TT-BrianLiu commented 7 months ago

Does it always error out on that machine with watcher enabled? Also, on other machines, does it always hang without watcher and always pass with watcher?

AleksKnezevic commented 7 months ago

On other machines I've always seen it hang without watcher pass with.

tt-dma commented 7 months ago

Only ran once on that machine & saw the hang with watcher enabled. Let me know if you are not able to reproduce

TT-BrianLiu commented 7 months ago

Please file issue under global project boards

mtatsumiTT commented 7 months ago

It seems the hang is related to TtLMSDiscreteScheduler. The demo script runs without hang after I replaced it with LMSDiscreteScheduler from diffuser package (commit).

AleksKnezevic commented 7 months ago

@boris-drazic, have we brought up and tested the demo with the on-device scheduler? Is there a unit test we can try?

Sudharsan-V commented 7 months ago

Hi @AleksKnezevic, as @mtatsumiTT pointed the hang is related to TtLMSDiscreteScheduler, to be more specific the test hangs when the flow enters the following if condition in the TtLMSDiscreteScheduler's step function.

      if derivative_tensor.shape[0] > 1:
            derivative_tensor = ttnn.permute(derivative_tensor, (3, 1, 2, 0))
            derivative_tensor = ttnn.sum(derivative_tensor, dim=-1)
            derivative_tensor = ttnn.permute(derivative_tensor, (3, 1, 2, 0))

The flow enters this condition and fails to perform the permute. Even after converting the ttnn.permute to PyTorch permute, hang occurs.

AleksKnezevic commented 7 months ago

@Sudharsan-V my guess is that we're seeing the hang there because we're trying to copy the tensor back from device to do a fallback permute and the hang occurs in a previous op. Do we have a unit test for the scheduler? Does it hang? Have you tried it with watcher or slow dispatch?

Sudharsan-V commented 7 months ago

@AleksKnezevic , since it is a helper function, unit test is not available for scheduler. No, I didn't try with watcher or slow dispatch.

jliangTT commented 7 months ago

Latest data:

jliangTT commented 7 months ago

@AleksKnezevic , any update on this one? (last update was to produce smaller repro for debugging, did it get handed off to someone to investgiate & root cause)

AleksKnezevic commented 7 months ago

Pardon the delayed response, we thought the issue went away, but it is still present on some machines. @mtatsumiTT, can you please rebase and try and repro on your end again?

jliangTT commented 6 months ago

@mtatsumiTT , do we have any update here?

mtatsumiTT commented 6 months ago

I tried to repro the hang on different B0 machines (bgd-lab-10/bgd-lab-15/bgd-lab-11/yyz-lab-71), but the demo script passes on all devices. Also, the model-perf that runs the demo script passes (log:cnn-javelin) on the branch after rebasing, so I presume the hang is already addressed.

jliangTT commented 6 months ago

let's downgrade until we see more sighting.

pgkeller commented 3 months ago

is this issue stale? can we close?

AleksKnezevic commented 3 months ago

It's still hanging according to @tt-rkim but I don't think it's a runtime issue, most likely di/dt related. Closing.