Open skhorasganiTT opened 2 weeks ago
@uaydonat possibly di/dt related
The single-card falcon7b functionality demo is also hanging non-deterministically on n300. E.g: Single-card falcon7b functionality demo passing (16123a1): https://github.com/tenstorrent/tt-metal/actions/runs/11811548922/job/32905366114 Single-card falcon7b functionality demo hanging (a5d9979): https://github.com/tenstorrent/tt-metal/actions/runs/11816808517/job/32921118321
Note that 16123a1 was already hanging for t3k as stated in the issue description.
The Falcon7b t3k demo is hanging non-deterministically (observed for both the 1024 and 2048 sequence length tests, and in both the prefill and decode stages) on CI and locally. It is unclear when the issue started happening as recent CI runs (including this one for the commit below) have been passing. In addition, sometimes the error in the picture below occurs instead of a hang.
Commit: 16123a1 Command:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest --disable-warnings -q -s --input-method=json --input-path='models/demos/t3000/falcon7b/input_data_t3000.json' models/demos/t3000/falcon7b/demo_t3000.py::test_demo_multichip[wormhole_b0-True-user_input0-8-True-perf_mode_2048_stochastic]
Example failing CI run: https://github.com/tenstorrent/tt-metal/actions/runs/11827306784