Open skhorasganiTT opened 3 months ago
Note that we don't run performance models on a X1 machine, only X2.
The hang/ND-outputs has never been observed on nebula x2 after 100 runs of the demo (note: experiment is run on single device of t3000), except when forcing 8x8 core grid using WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)
this is interesting. 8x8 fails on both n300 with ethernet idle core (X2) and n150 (X1), while 7x8 passes on n300.
Are we able to isolate to which op this is with a repro?
The 8x8 failure on n300 occurs with the same ops listed above for n150, specifically RotaryEmbedding and the lm-head Matmul (although it is non-deterministic).
@uaydonat Strange observation/breakthrough: If this line is removed (i.e. input tokens are not updated at end of iteration), no longer seeing hang. Obviously we can't remove that line, but will try to investigate why that is making a difference.
Additional finding: On X1, if the grid size for all matmuls (1d, dram interleaved) is set to 8x7, I no longer see the hang. Based on this and the other observations above, the issue seems to be a combination of grid size and timing.
Investigation update:
Commit: 925c9b04ef9e31b844eadf11b37ce3b537e4a249
Command: pytest --disable-warnings -q -s --input-method=cli --cli-input="Tell me a joke." models/demos/falcon7b/demo/demo.py
i think we should have @TT-BrianLiu starting taking a look at the matmul behaviorial. Triaging to op_cat: mm queue
I am in the loop with this issue already. @skhorasganiTT Do you have next steps to try?
@cfjchu mentioned this may be related to another hang #6917 (1d mm + l1 packing seems sus)
Disabling l1 accumulation on all matmuls makes the hang/nd-outputs significantly (> 10x) less frequent
Probably a timing issue. L1 accum is supposed to be faster?
@s-jovic mentioned that this might be related to the block matmul hang from #4968
If you suspect that #4968, it might be worth just looping that matmul with different clks to confirm.
Added workaround to main to set falcon7b matmuls to 8x7 (e766e2d) (does not noticeably affect end-to-end perf). Lowering priority to P1 until related matmul nd-output/hang issues are investigated.
@skhorasganiTT we should try without the work-around using FD2.
I re-tried the demo on x1 (aus-glx-06) without the workaround and with the firmware update on c3bbc07 (after FD2). It worked 16 times and hung on the 17th, so it seems like it's still an issue, although much less frequent (before it would hang after a couple runs).
Command:
pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/falcon7b/demo/input_data.json' models/demos/wormhole/falcon7b/demo_wormhole.py
I re-tried the demo on x1 without the workaround and with the firmware update on c3bbc07 (after FD2). It worked 16 times and hung on the 17th, so it seems like it's still an issue, although much less frequent (before it would hang after a couple runs).
Based on the data, reducing MM grid or slowing MM makes the hang go away / less frequent:
@uaydonat this smells like di/dt fyi @ttmtrajkovic @rtawfik01
@skhorasganiTT which machine did you see the hang on?
Can you try it on a different machine to assess board-dependence?
It was on aus-glx-06, I will try another one as well.
can you try a T3000 or a single N150, or a single N300?
aus-glx-06 is configured as a single n150, i can try it on n300 with WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
as well
Tried the same test (same commit and firmware) on sjc-nmrk-t3001 without the workaround, hung on the 9th run of the demo. With the workaround, passed 50 times.
Command:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/falcon7b/demo/input_data.json' models/demos/wormhole/falcon7b/demo_wormhole.py
The Falcon7b demo randomly hangs during different invocations of the model forward pass (both compile and inference, and both prefill and decode, but usually decode inference). Additionally, the model usually produces non-deterministic and incorrect output before hanging. The hangs / incorrect outputs become more likely as the number of output tokens increases (i.e. more forward passes). The frequency of the hang is machine dependent, but it can occur as often as every 1-4 runs of the demo.
Additional information:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)TT_METAL_SLOW_DISPATCH_MODE=1
)HWCommandQueue::enqueue_command
)TT_METAL_WATCHER=1
, making timing a potential culpritTT_METAL_LOGGER_TYPES=Op TT_METAL_LOGGER_LEVEL=DEBUG
Instructions to stress-test demo:
Commit:
b5fe44ddf7631e3d59cb953c238666113a76913d
bash models/demos/falcon7b/tests/run_demo_test.sh