tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
475 stars 75 forks source link

Llama70b trace demo hangs after several decode iterations #15018

Open skhorasganiTT opened 6 days ago

skhorasganiTT commented 6 days ago

On main (a5d9979), the llama70b demo is hanging (with trace) after several decode iterations. The last known working commit was 81033ff. The hang also occurs when running with 10 layers (instead of the full 80), but has not been observed when running with 1 layer.

Command:

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest -svv models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-device_params0-short_context-check_disabled-sampling-tt-70b-T3000-10L-decode_only-trace_mode_on-text_completion-llama3]

cc @cglagovichTT @uaydonat

skhorasganiTT commented 6 days ago

The hang starts occurring with commit d9df440. @jbaumanTT Would you be able to take a look?

skhorasganiTT commented 5 days ago

The hang also occurs non-deterministically when running the demo without tracing (by replacing trace_mode_on with trace_mode_off in the command above).

jbaumanTT commented 5 days ago

@pgkeller FYI the async ringbuffer patch caused this problem for some reason. I'm currently investigating.

jbaumanTT commented 5 days ago

Had to run with export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 for the hang to happen with the watcher enabled, but I'm seeing this error then:

                 Always | WARNING  | Watcher detected stack usage within 10% of max on Device 0 Core (x=1,y=1): trisc0! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 292/320 of the stack.
                 Always | FATAL    | Watcher detected stack overflow on Device 0 Core (x=1,y=1): trisc1! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 256/256 of the stack.
pgkeller commented 4 days ago

Had to run with export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 for the hang to happen with the watcher enabled, but I'm seeing this error then:

                 Always | WARNING  | Watcher detected stack usage within 10% of max on Device 0 Core (x=1,y=1): trisc0! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 292/320 of the stack.
                 Always | FATAL    | Watcher detected stack overflow on Device 0 Core (x=1,y=1): trisc1! Kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp uses 256/256 of the stack.

stack overflow could be a sign of a problem as it means the stack may over-write the globals, however, if there are few globals stack overflow could be benign. you can tweak the stack size in dev_mem_map and see if the stack overflow goes away or if the kernel fails to compile (since a larger stack reduces space for globals).

jbaumanTT commented 4 days ago

Yeah, I bumped up the stack space and that fixed the warning, but the hang still occurs (I filed https://github.com/tenstorrent/tt-metal/issues/15066 for the stack overflow problem). I'm currently trying to figure out if I can dump state for devices other than device 0. Since I only see device 0 hanging in all_gather, it's likely some other device is hanging and not meeting up at the right time.