Falcon7b (tt-lib) non-deterministic demo hang on nebula x1

skhorasganiTT commented 3 months ago

The Falcon7b demo randomly hangs during different invocations of the model forward pass (both compile and inference, and both prefill and decode, but usually decode inference). Additionally, the model usually produces non-deterministic and incorrect output before hanging. The hangs / incorrect outputs become more likely as the number of output tokens increases (i.e. more forward passes). The frequency of the hang is machine dependent, but it can occur as often as every 1-4 runs of the demo.

Additional information:

800 MHz clock is being used
This is not a newly introduced bug (first observation was in late Feb/ early March)
The hang/ND-outputs has never been observed on nebula x2 after 100 runs of the demo (note: experiment is run on single device of t3000), except when forcing 8x8 core grid using WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)
The hang still occurs with slow dispatch (using TT_METAL_SLOW_DISPATCH_MODE=1)
The hang still occurs after forcing all ops to be blocking (by hacking HWCommandQueue::enqueue_command)
The last op running before hanging is inconsistent, but has been observed to be (from most frequent to least): the lm-head Matmul op, the RotaryEmbedding op, the EltwiseBinary Add op. All of these have DRAM-interleaved inputs and outputs under the default model config in the demo, and there are no sharded ops in the model, making dram-interleaved ops potential culprits
The hang/ND-outputs has not yet been observed using TT_METAL_WATCHER=1, making timing a potential culprit
The hang/ND-outputs occurs more often when using TT_METAL_LOGGER_TYPES=Op TT_METAL_LOGGER_LEVEL=DEBUG

Instructions to stress-test demo:

Commit: b5fe44ddf7631e3d59cb953c238666113a76913d bash models/demos/falcon7b/tests/run_demo_test.sh

tt-rkim commented 3 months ago

Note that we don't run performance models on a X1 machine, only X2.

jliangTT commented 3 months ago

The hang/ND-outputs has never been observed on nebula x2 after 100 runs of the demo (note: experiment is run on single device of t3000), except when forcing 8x8 core grid using WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)

this is interesting. 8x8 fails on both n300 with ethernet idle core (X2) and n150 (X1), while 7x8 passes on n300.

Are we able to isolate to which op this is with a repro?

skhorasganiTT commented 3 months ago

The 8x8 failure on n300 occurs with the same ops listed above for n150, specifically RotaryEmbedding and the lm-head Matmul (although it is non-deterministic).

skhorasganiTT commented 3 months ago

@uaydonat Strange observation/breakthrough: If this line is removed (i.e. input tokens are not updated at end of iteration), no longer seeing hang. Obviously we can't remove that line, but will try to investigate why that is making a difference.

skhorasganiTT commented 3 months ago

Additional finding: On X1, if the grid size for all matmuls (1d, dram interleaved) is set to 8x7, I no longer see the hang. Based on this and the other observations above, the issue seems to be a combination of grid size and timing.

skhorasganiTT commented 3 months ago

Investigation update:

The hang appears to be a symptom of data corruption which causes the model to produce ND outputs during the prefill stage. Without any modifications, the ND outputs start occurring at different stages of the model and at different ops.
However, after adding numerous sync points to the model (by saving intermediate tensor outputs using tt2torch calls) and removing some of the ops, the ND outputs can be isolated to a specific op - the h-to-4h matmul in the 4th layer (out of 32) of the model. Both inputs of the matmul are correct before and after the op, while the output always seems to have corruption in 2 tiles (the corruption appears as small differences in the output, and is not the same between runs).
Forcing all matmuls to have a compute grid size of 8x7 makes the ND behaviour go away as expected. An additional finding is that the ND behaviour can also go away by forcing out_subblock_h = out_subblock_w = 1 (which happens to be the configuration of this matmul when using 8x7 grid size). One hypothesis for this making the ND output go away is that this makes the matmul slower and eliminates some race condition.

Commit: 925c9b04ef9e31b844eadf11b37ce3b537e4a249 Command: pytest --disable-warnings -q -s --input-method=cli --cli-input="Tell me a joke." models/demos/falcon7b/demo/demo.py

jliangTT commented 3 months ago

i think we should have @TT-BrianLiu starting taking a look at the matmul behaviorial. Triaging to op_cat: mm queue

TT-BrianLiu commented 3 months ago

I am in the loop with this issue already. @skhorasganiTT Do you have next steps to try?

jliangTT commented 3 months ago

@cfjchu mentioned this may be related to another hang #6917 (1d mm + l1 packing seems sus)

skhorasganiTT commented 3 months ago

Disabling l1 accumulation on all matmuls makes the hang/nd-outputs significantly (> 10x) less frequent

TT-BrianLiu commented 3 months ago

Probably a timing issue. L1 accum is supposed to be faster?

skhorasganiTT commented 3 months ago

@s-jovic mentioned that this might be related to the block matmul hang from #4968

TT-BrianLiu commented 3 months ago

If you suspect that #4968, it might be worth just looping that matmul with different clks to confirm.

jliangTT commented 3 months ago

Decode hang
but prefill generate incorrect result, and may coorelate with decode hang
result using watcher: see the fast dispatcher waiting on core to return. may be not FD

skhorasganiTT commented 2 months ago

Added workaround to main to set falcon7b matmuls to 8x7 (e766e2d) (does not noticeably affect end-to-end perf). Lowering priority to P1 until related matmul nd-output/hang issues are investigated.

uaydonat commented 1 month ago

@skhorasganiTT we should try without the work-around using FD2.

skhorasganiTT commented 1 month ago

I re-tried the demo on x1 (aus-glx-06) without the workaround and with the firmware update on c3bbc07 (after FD2). It worked 16 times and hung on the 17th, so it seems like it's still an issue, although much less frequent (before it would hang after a couple runs). Command: pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/falcon7b/demo/input_data.json' models/demos/wormhole/falcon7b/demo_wormhole.py

davorchap commented 1 month ago

I re-tried the demo on x1 without the workaround and with the firmware update on c3bbc07 (after FD2). It worked 16 times and hung on the 17th, so it seems like it's still an issue, although much less frequent (before it would hang after a couple runs).

Based on the data, reducing MM grid or slowing MM makes the hang go away / less frequent:

switch to MMs from 8x8 to 8x7 --> hang gone
disable L1 accumulation (slow down MMs) --> hang 10x frequent
out_subblock_h = out_subblock_w = 1 (slow down MMs) --> hang gone

@uaydonat this smells like di/dt fyi @ttmtrajkovic @rtawfik01

uaydonat commented 1 month ago

@skhorasganiTT which machine did you see the hang on?

Can you try it on a different machine to assess board-dependence?

skhorasganiTT commented 1 month ago

It was on aus-glx-06, I will try another one as well.

uaydonat commented 1 month ago

can you try a T3000 or a single N150, or a single N300?

skhorasganiTT commented 1 month ago

aus-glx-06 is configured as a single n150, i can try it on n300 with WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml as well

skhorasganiTT commented 1 month ago

Tried the same test (same commit and firmware) on sjc-nmrk-t3001 without the workaround, hung on the 9th run of the demo. With the workaround, passed 50 times. Command: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/falcon7b/demo/input_data.json' models/demos/wormhole/falcon7b/demo_wormhole.py

tenstorrent / tt-metal

Falcon7b (tt-lib) non-deterministic demo hang on nebula x1 #6795