tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
302 stars 24 forks source link

Falcon7b (tt-lib) non-deterministic demo hang on nebula x1 #6795

Open skhorasganiTT opened 3 months ago

skhorasganiTT commented 3 months ago

The Falcon7b demo randomly hangs during different invocations of the model forward pass (both compile and inference, and both prefill and decode, but usually decode inference). Additionally, the model usually produces non-deterministic and incorrect output before hanging. The hangs / incorrect outputs become more likely as the number of output tokens increases (i.e. more forward passes). The frequency of the hang is machine dependent, but it can occur as often as every 1-4 runs of the demo.

Additional information:

Instructions to stress-test demo:

Commit: b5fe44ddf7631e3d59cb953c238666113a76913d bash models/demos/falcon7b/tests/run_demo_test.sh

tt-rkim commented 3 months ago

Note that we don't run performance models on a X1 machine, only X2.

jliangTT commented 3 months ago

The hang/ND-outputs has never been observed on nebula x2 after 100 runs of the demo (note: experiment is run on single device of t3000), except when forcing 8x8 core grid using WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)

this is interesting. 8x8 fails on both n300 with ethernet idle core (X2) and n150 (X1), while 7x8 passes on n300.

Are we able to isolate to which op this is with a repro?

skhorasganiTT commented 3 months ago

The 8x8 failure on n300 occurs with the same ops listed above for n150, specifically RotaryEmbedding and the lm-head Matmul (although it is non-deterministic).

skhorasganiTT commented 3 months ago

@uaydonat Strange observation/breakthrough: If this line is removed (i.e. input tokens are not updated at end of iteration), no longer seeing hang. Obviously we can't remove that line, but will try to investigate why that is making a difference.

skhorasganiTT commented 3 months ago

Additional finding: On X1, if the grid size for all matmuls (1d, dram interleaved) is set to 8x7, I no longer see the hang. Based on this and the other observations above, the issue seems to be a combination of grid size and timing.

skhorasganiTT commented 3 months ago

Investigation update:

Commit: 925c9b04ef9e31b844eadf11b37ce3b537e4a249 Command: pytest --disable-warnings -q -s --input-method=cli --cli-input="Tell me a joke." models/demos/falcon7b/demo/demo.py

jliangTT commented 3 months ago

i think we should have @TT-BrianLiu starting taking a look at the matmul behaviorial. Triaging to op_cat: mm queue

TT-BrianLiu commented 3 months ago

I am in the loop with this issue already. @skhorasganiTT Do you have next steps to try?

jliangTT commented 3 months ago

@cfjchu mentioned this may be related to another hang #6917 (1d mm + l1 packing seems sus)

skhorasganiTT commented 3 months ago

Disabling l1 accumulation on all matmuls makes the hang/nd-outputs significantly (> 10x) less frequent

TT-BrianLiu commented 3 months ago

Probably a timing issue. L1 accum is supposed to be faster?

skhorasganiTT commented 3 months ago

@s-jovic mentioned that this might be related to the block matmul hang from #4968

TT-BrianLiu commented 3 months ago

If you suspect that #4968, it might be worth just looping that matmul with different clks to confirm.

jliangTT commented 3 months ago
skhorasganiTT commented 2 months ago

Added workaround to main to set falcon7b matmuls to 8x7 (e766e2d) (does not noticeably affect end-to-end perf). Lowering priority to P1 until related matmul nd-output/hang issues are investigated.

uaydonat commented 1 month ago

@skhorasganiTT we should try without the work-around using FD2.

skhorasganiTT commented 1 month ago

I re-tried the demo on x1 (aus-glx-06) without the workaround and with the firmware update on c3bbc07 (after FD2). It worked 16 times and hung on the 17th, so it seems like it's still an issue, although much less frequent (before it would hang after a couple runs). Command: pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/falcon7b/demo/input_data.json' models/demos/wormhole/falcon7b/demo_wormhole.py

davorchap commented 1 month ago

I re-tried the demo on x1 without the workaround and with the firmware update on c3bbc07 (after FD2). It worked 16 times and hung on the 17th, so it seems like it's still an issue, although much less frequent (before it would hang after a couple runs).

Based on the data, reducing MM grid or slowing MM makes the hang go away / less frequent:

@uaydonat this smells like di/dt fyi @ttmtrajkovic @rtawfik01

uaydonat commented 1 month ago

@skhorasganiTT which machine did you see the hang on?

Can you try it on a different machine to assess board-dependence?

skhorasganiTT commented 1 month ago

It was on aus-glx-06, I will try another one as well.

uaydonat commented 1 month ago

can you try a T3000 or a single N150, or a single N300?

skhorasganiTT commented 1 month ago

aus-glx-06 is configured as a single n150, i can try it on n300 with WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml as well

skhorasganiTT commented 1 month ago

Tried the same test (same commit and firmware) on sjc-nmrk-t3001 without the workaround, hung on the 9th run of the demo. With the workaround, passed 50 times. Command: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/falcon7b/demo/input_data.json' models/demos/wormhole/falcon7b/demo_wormhole.py