Open skhorasganiTT opened 6 days ago
The hang starts occurring with commit d9df440. @jbaumanTT Would you be able to take a look?
The hang also occurs non-deterministically when running the demo without tracing (by replacing trace_mode_on
with trace_mode_off
in the command above).
@pgkeller FYI the async ringbuffer patch caused this problem for some reason. I'm currently investigating.
Had to run with export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 for the hang to happen with the watcher enabled, but I'm seeing this error then:
Always | WARNING | Watcher detected stack usage within 10% of max on Device 0 Core (x=1,y=1): trisc0! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 292/320 of the stack.
Always | FATAL | Watcher detected stack overflow on Device 0 Core (x=1,y=1): trisc1! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 256/256 of the stack.
Had to run with export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 for the hang to happen with the watcher enabled, but I'm seeing this error then:
Always | WARNING | Watcher detected stack usage within 10% of max on Device 0 Core (x=1,y=1): trisc0! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 292/320 of the stack. Always | FATAL | Watcher detected stack overflow on Device 0 Core (x=1,y=1): trisc1! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 256/256 of the stack.
stack overflow could be a sign of a problem as it means the stack may over-write the globals, however, if there are few globals stack overflow could be benign. you can tweak the stack size in dev_mem_map and see if the stack overflow goes away or if the kernel fails to compile (since a larger stack reduces space for globals).
Yeah, I bumped up the stack space and that fixed the warning, but the hang still occurs (I filed https://github.com/tenstorrent/tt-metal/issues/15066 for the stack overflow problem). I'm currently trying to figure out if I can dump state for devices other than device 0. Since I only see device 0 hanging in all_gather, it's likely some other device is hanging and not meeting up at the right time.
On main (a5d9979), the llama70b demo is hanging (with trace) after several decode iterations. The last known working commit was 81033ff. The hang also occurs when running with 10 layers (instead of the full 80), but has not been observed when running with 1 layer.
Command:
cc @cglagovichTT @uaydonat