tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
488 stars 80 forks source link

Falcon7b t3k/single-card demos hang non-deterministically #15059

Open skhorasganiTT opened 2 weeks ago

skhorasganiTT commented 2 weeks ago

The Falcon7b t3k demo is hanging non-deterministically (observed for both the 1024 and 2048 sequence length tests, and in both the prefill and decode stages) on CI and locally. It is unclear when the issue started happening as recent CI runs (including this one for the commit below) have been passing. In addition, sometimes the error in the picture below occurs instead of a hang.

Commit: 16123a1 Command: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/t3000/falcon7b/input_data_t3000.json' models/demos/t3000/falcon7b/demo_t3000.py::test_demo_multichip[wormhole_b0-True-user_input0-8-True-perf_mode_2048_stochastic]

Example failing CI run: https://github.com/tenstorrent/tt-metal/actions/runs/11827306784

Image

skhorasganiTT commented 2 weeks ago

@uaydonat possibly di/dt related

skhorasganiTT commented 2 weeks ago

The single-card falcon7b functionality demo is also hanging non-deterministically on n300. E.g: Single-card falcon7b functionality demo passing (16123a1): https://github.com/tenstorrent/tt-metal/actions/runs/11811548922/job/32905366114 Single-card falcon7b functionality demo hanging (a5d9979): https://github.com/tenstorrent/tt-metal/actions/runs/11816808517/job/32921118321

Note that 16123a1 was already hanging for t3k as stated in the issue description.