Open avoraTT opened 3 weeks ago
cc @cglagovich
This hang is due to one of the ERISC data mover cores getting an incorrect/semi-corrupted kernel binary during an all-gather op. Thus, the EDM kernel does not even start and the all-gather op hangs during trace execution.
This binary gets corrupted, since prefetch_d
is writing additional commands (multicast dispatch_write_packed
for initializing worker semaphores for the program after the all-gather) to dispatch_d
's command buffer. These commands are incorrectly sent to an active page in the command buffer, which contains data for the EDM kernel binary, which gets overwritten before its sent to the ERISC core.
This seems like an issue with accounting for the dispatch CB write ptr on the prefetch_d
-> dispatch_d
path.
fyi @ubcheema @pgkeller
Changes on asaigal/llama_trace_hang resolve the issue. I ran LLAMA3 80L for 500 iterations and it passes consistently with trace.
The caveat here is that we needed to revert some host side changes required for high-perf in order to bypass the hang, so I can't mainline these changes yet. Will investigate the new hang before merging to main, but this branch should unblock you from using trace at the TTNN level.
Changes to resolve the hang have been pushed to main at a5a94b4. I ran LLAMA3 80L with trace for over 1000 loops and it seems stable.
Repro Steps
Machine:
sjc-svna-t3002
Branch:model-team/demo-trace
Command:pytest models/demos/t3000/llama3_70b/demo/demo.py::test_LlamaModel_demo[wormhole_b0-True-device_params0-check_enabled-greedy-tt-70b-T3000-30L-decode_only-text_completion-llama3]
Summary
To measure perf gains provided by trace for llama decode, the decode script was adjusted to run in a loop for 50 iterations on the same token. Perf improvements are observed for lower layer counts, however, from 30 layers and onwards, the trace execution results in a hang.
With watcher enabled (with env variable
TT_METAL_WATCHER=120
for 30 layers, the loop runs for 6 iterations before resulting in a deterministic hang on the same kernel. Watcher logs:Notes on adjustments made to existing llama demo script to measure trace perf:
models/demos/t3000/llama3_70b/demo/demo.py
model.forward
call inmodels/demos/t3000/llama2_70b/demo/demo.py
was adjusted to take in an arbitrary token position (currently set to 126)model.forward
call, abreak
is triggered to exit earlydecode_forward
function found inmodels/demos/t3000/llama2_70b/tt/llama_generation.py