Closed TT-billteng closed 4 months ago
fyi @tt-nshanker
@nsmithtt @tt-nshanker @AleksKnezevic , this is a flaky conv test that needs some attention. It looks to be disabled and need to be investigated. Should this be p0 if we don't have coverage on the conv for SD?
@jliangTT, I think it's OK to leave as P1. SD demo does not exhibit the same behavior so the hang is likely caused by some specific usage of the unit test. I agree that we need to have coverage, but I don't think we should drop our current work items for this.
@mtatsumiTT Are you working on this issue? if not I can start looking at it.
No I haven't started, that would be great if you can take it 🙂
btw.. this is still in disabled in main. @shwetankTT , are you still working on this?
@jliangTT I was not able to reproduce this. I have raised a PR to re-enable this test. I will try to merge it. PR
Since it is flagged flaky, that means the failure might not happen deterministically and might want to make sure you execute more run to double check. @TT-billteng
Yeah I did a repeat test ~50 times locally on WH machines and ~5 times over the CI and couldn't reproduce it.
you looped that one configuration 50 times?
Ran it multiple times single test and with pythest-repeat. Tried with something like this. pytest --count=20 tests/ttnn/unit_tests/operations/test_conv2d.py
so that loops over all the configurations and sometimes only one specific configuration is bad
can you try looping over the one config I listed at the top?
@mywoodstock you re-enabled this test https://github.com/tenstorrent/tt-metal/commit/122a4e1f2ccaeae0532d85e3c8f9e504c78c1c1b . Is this intentional?
I see it hang here on main https://github.com/tenstorrent/tt-metal/actions/runs/8916840605/job/24489288723
Yes intentional. It's been working fine for me and we don't have proper testing coverage enabled for convs on WH. We should try to debug the hang instead.
We can disable only this particular instance of the test instead of the whole set.
Btw. i just tried looping over this test, and was able to repro after 35 iters. to disable this one: https://github.com/tenstorrent/tt-metal/pull/8038
just for documenting: watcher shows this, where core (9,5) looks sus, running kernels 8, 9, 10.
@shwetankTT is currently working on #8131. Will get this to this one once 8131 is done. Feel free to re-assign if resources are available.
The code is getting stuck at core 9,5 tile_regs_acquire call. Issue still debugging. Hang reason: https://github.com/tenstorrent/tt-metal/blob/c3bbc071feec3d733f17528d6d1c80b5b6285d77/tt_eager/tt_dnn/op_library/conv/kernels/conv_bmm_tilize_col_major_out_blocks.cpp#L331
Thanks @shwetankTT!
For Documentation purpose. I have tried running this test case with 800Mhz and ran the test case ~10000 iteration without any hang. The issue can be reproduced at 1000mhz.
If it only happens at 1GHz, it might be related to the di/dt issue? @jliangTT
ok. labeling it as di/dt so we can look at it together.
Hi @shwetankTT @mywoodstock @TT-billteng , when you reproduced the original issue above, do you know which firmware version you were on? I tried running the above test in a bash loop for a 1000 iterations:
for ((i=1; i<=1000; i++))
do
echo "============ Running command for iteration $i ============"
if ! timeout 400s pytest -svv tests/ttnn/unit_tests/operations/test_conv2d.py::test_sd_conv_wh[enable_auto_formatting=True-math_fidelity=MathFidelity.LoFi-fp32_accum=False-activations_dtype=DataType.BFLOAT16-weights_dtype=DataType.BFLOAT8_B-batch_size=2-output_channels=320-input_channels=16-input_height=64-input_width=64-filter_height=3-filter_width=3-stride_h=1-stride_w=1-pad_h=1-pad_w=1-use_1d_systolic_array=True-config_override=None-device_l1_small_size=16384]; then
break
fi
done
and this passes locally for me on my N150, with firmware version 8.12, which uses 1GHz Aiclk + voltage margin bump. Can you let me know if you are able to reproduce on the new firmware?
you should try looping with pytest-repeat
pip install pytest-repeat
pytest --count=1000 ....
@rtawfik01 I am able to reproduce this issue on aus-glx-03 machine. The firmware are mentioned below.
@shwetankTT, @TT-billteng, @tt-rkim, @tapspatel
I've checked the aus-glx-03
machine and firmware versions running on all pci interfaces visible in the system: 0, 1, 2, 3.
None of them have the latest firmware 8.12 (with FMAX = 1000MHz and VOLTAGE_MARGIN = 50mV) so this test should be repeated with updated firmware.
Could also someone explain how are these nebula_x1_galaxy
systems configured, I can see 4 TT pcie devices so are those 4 separate n300 / n150 cards or n150 card connected to galaxy. If its latter, where do other pcie devices come from?
That also yields the next question of how are these machines maintained - pcie device 0 has an older firmware 8.11 (reduced clock) and devices 1-3 have even older firmware, Fmax = 1GHz.
Thanks.
Milos
hey @ttmtrajkovic
We’ve given the onus on devs to make sure their machine is flashed with the correct versioning of fw via scripts here: https://github.com/tenstorrent/tt-metal/tree/main/scripts/install. Everything aside from galaxy. For IRD machines, a ticket needs to be filed to Devinfra. For non-ird, you can do it yourself. These tools are all using open source repos, and I haven’t yet received the di/dt patch fix for galaxy (only WH). There is an issue tracking that here: https://github.com/tenstorrent/tt-firmware/issues/6. As of right now, only 2 folks are using galaxy, and the galaxies provisioned to them are still running at the 800mhz version (until I get the patch).
In IRD, machines aus-glx-01 to aus-glx-13 are all TG systems ie 4xn150s connected to a galaxy. However, we were running low on n150 cards for developers, so instead of letting the galaxies stay idle, I’ve just disabled the nebula <-> galaxy connections, which allows each individual user to use an n150 card standalone. As work on galaxy ramps up, I’ll enable all these machines again.
You can use IRD to reserve a single n150 card or all n150 cards via ird reserve --timeout 600 wormhole_b0 --model x1_galaxy --machine aus-glx-03 --num-pcie-chips <1 or 4>
Individual n150 firmware maintenance is on devs when they use the board. Galaxy firmware maintenance is on me until patch comes along, after which ill provide helper scripts for devs to do it themselves. All other maintenance is on Devinfra.
I've flashed the 8.12 firmware on aus-glx-03, and I no longer see the hang issue. I'm currently flashing the firmware on a couple of other machines to ensure that the behavior is consistent across all machines.
issue has been resolved. I am keeping it open incase the issue appears again on latest FW.
closing, please re-open a new case if the issue reappears.
https://github.com/tenstorrent-metal/tt-metal/actions/runs/8574499018/job/23501770228
Hangs in