Hang when using Trace w/ Llama Decode

avoraTT commented 3 weeks ago

Repro Steps

Machine: sjc-svna-t3002 Branch: model-team/demo-trace Command: pytest models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-device_params0-check_enabled-greedy-tt-70b-T3000-30L-decode_only-text_completion-llama3]

Summary

To measure perf gains provided by trace for llama decode, the decode script was adjusted to run in a loop for 50 iterations on the same token. Perf improvements are observed for lower layer counts, however, from 30 layers and onwards, the trace execution results in a hang.

With watcher enabled (with env variable TT_METAL_WATCHER=120 for 30 layers, the loop runs for 6 iterations before resulting in a deterministic hang on the same kernel. Watcher logs:

Notes on adjustments made to existing llama demo script to measure trace perf:

The trace region size is set as a pytest parameter in this file: models/demos/t3000/llama3_70b/demo/demo.py
The model.forward call in models/demos/t3000/llama2_70b/demo/demo.py was adjusted to take in an arbitrary token position (currently set to 126)
- After the model.forward call, a break is triggered to exit early
The compile, trace capture, and trace execute steps were added in the decode_forward function found in models/demos/t3000/llama2_70b/tt/llama_generation.py

avoraTT commented 3 weeks ago

cc @cglagovich

tt-asaigal commented 2 weeks ago

This hang is due to one of the ERISC data mover cores getting an incorrect/semi-corrupted kernel binary during an all-gather op. Thus, the EDM kernel does not even start and the all-gather op hangs during trace execution.

This binary gets corrupted, since prefetch_d is writing additional commands (multicast dispatch_write_packed for initializing worker semaphores for the program after the all-gather) to dispatch_d's command buffer. These commands are incorrectly sent to an active page in the command buffer, which contains data for the EDM kernel binary, which gets overwritten before its sent to the ERISC core.

This seems like an issue with accounting for the dispatch CB write ptr on the prefetch_d -> dispatch_d path.

fyi @ubcheema @pgkeller

tt-asaigal commented 1 week ago

Changes on asaigal/llama_trace_hang resolve the issue. I ran LLAMA3 80L for 500 iterations and it passes consistently with trace.

The caveat here is that we needed to revert some host side changes required for high-perf in order to bypass the hang, so I can't mainline these changes yet. Will investigate the new hang before merging to main, but this branch should unblock you from using trace at the TTNN level.

tt-asaigal commented 4 days ago

Changes to resolve the hang have been pushed to main at a5a94b4. I ran LLAMA3 80L with trace for over 1000 loops and it seems stable.

tenstorrent / tt-metal