tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
428 stars 57 forks source link

flaky behavior with `test_conv2d.py::test_sd_conv_wh` #7179

Closed TT-billteng closed 4 months ago

TT-billteng commented 6 months ago

https://github.com/tenstorrent-metal/tt-metal/actions/runs/8574499018/job/23501770228

Hangs in

tests/ttnn/unit_tests/operations/test_conv2d.py::test_sd_conv_wh[enable_auto_formatting=False-math_fidelity=MathFidelity.LoFi-fp32_accum=False-activations_dtype=DataType.BFLOAT16-weights_dtype=DataType.BFLOAT8_B-batch_size=2-output_channels=320-input_channels=16-input_height=64-input_width=64-filter_height=3-filter_width=3-stride_h=1-stride_w=1-pad_h=1-pad_w=1-use_1d_systolic_array=True-config_override=None]                   Metal | INFO     | Initializing device 0
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
mtatsumiTT commented 6 months ago

fyi @tt-nshanker

jliangTT commented 6 months ago

@nsmithtt @tt-nshanker @AleksKnezevic , this is a flaky conv test that needs some attention. It looks to be disabled and need to be investigated. Should this be p0 if we don't have coverage on the conv for SD?

nsmithtt commented 6 months ago

@jliangTT, I think it's OK to leave as P1. SD demo does not exhibit the same behavior so the hang is likely caused by some specific usage of the unit test. I agree that we need to have coverage, but I don't think we should drop our current work items for this.

shwetankTT commented 6 months ago

@mtatsumiTT Are you working on this issue? if not I can start looking at it.

mtatsumiTT commented 6 months ago

No I haven't started, that would be great if you can take it 🙂

jliangTT commented 5 months ago

btw.. this is still in disabled in main. @shwetankTT , are you still working on this?

shwetankTT commented 5 months ago

@jliangTT I was not able to reproduce this. I have raised a PR to re-enable this test. I will try to merge it. PR

jliangTT commented 5 months ago

Since it is flagged flaky, that means the failure might not happen deterministically and might want to make sure you execute more run to double check. @TT-billteng

shwetankTT commented 5 months ago

Yeah I did a repeat test ~50 times locally on WH machines and ~5 times over the CI and couldn't reproduce it.

TT-billteng commented 5 months ago

you looped that one configuration 50 times?

shwetankTT commented 5 months ago

Ran it multiple times single test and with pythest-repeat. Tried with something like this. pytest --count=20 tests/ttnn/unit_tests/operations/test_conv2d.py

TT-billteng commented 5 months ago

so that loops over all the configurations and sometimes only one specific configuration is bad

can you try looping over the one config I listed at the top?

TT-billteng commented 5 months ago

@mywoodstock you re-enabled this test https://github.com/tenstorrent/tt-metal/commit/122a4e1f2ccaeae0532d85e3c8f9e504c78c1c1b . Is this intentional?

I see it hang here on main https://github.com/tenstorrent/tt-metal/actions/runs/8916840605/job/24489288723

mywoodstock commented 5 months ago

Yes intentional. It's been working fine for me and we don't have proper testing coverage enabled for convs on WH. We should try to debug the hang instead.

mywoodstock commented 5 months ago

We can disable only this particular instance of the test instead of the whole set.

mywoodstock commented 5 months ago

Btw. i just tried looping over this test, and was able to repro after 35 iters. to disable this one: https://github.com/tenstorrent/tt-metal/pull/8038

mywoodstock commented 5 months ago

just for documenting: watcher shows this, where core (9,5) looks sus, running kernels 8, 9, 10.

Screenshot 2024-05-01 at 21 28 35 Screenshot 2024-05-01 at 21 28 47
jliangTT commented 5 months ago

@shwetankTT is currently working on #8131. Will get this to this one once 8131 is done. Feel free to re-assign if resources are available.

shwetankTT commented 5 months ago

The code is getting stuck at core 9,5 tile_regs_acquire call. Issue still debugging. Hang reason: https://github.com/tenstorrent/tt-metal/blob/c3bbc071feec3d733f17528d6d1c80b5b6285d77/tt_eager/tt_dnn/op_library/conv/kernels/conv_bmm_tilize_col_major_out_blocks.cpp#L331

mywoodstock commented 5 months ago

Thanks @shwetankTT!

shwetankTT commented 5 months ago

For Documentation purpose. I have tried running this test case with 800Mhz and ran the test case ~10000 iteration without any hang. The issue can be reproduced at 1000mhz.

mywoodstock commented 5 months ago

If it only happens at 1GHz, it might be related to the di/dt issue? @jliangTT

jliangTT commented 5 months ago

ok. labeling it as di/dt so we can look at it together.

rtawfik01 commented 4 months ago

Hi @shwetankTT @mywoodstock @TT-billteng , when you reproduced the original issue above, do you know which firmware version you were on? I tried running the above test in a bash loop for a 1000 iterations:

for ((i=1; i<=1000; i++))
do
    echo "============ Running command for iteration $i ============"
    if ! timeout 400s pytest -svv tests/ttnn/unit_tests/operations/test_conv2d.py::test_sd_conv_wh[enable_auto_formatting=True-math_fidelity=MathFidelity.LoFi-fp32_accum=False-activations_dtype=DataType.BFLOAT16-weights_dtype=DataType.BFLOAT8_B-batch_size=2-output_channels=320-input_channels=16-input_height=64-input_width=64-filter_height=3-filter_width=3-stride_h=1-stride_w=1-pad_h=1-pad_w=1-use_1d_systolic_array=True-config_override=None-device_l1_small_size=16384]; then
        break
    fi
done

and this passes locally for me on my N150, with firmware version 8.12, which uses 1GHz Aiclk + voltage margin bump. Can you let me know if you are able to reproduce on the new firmware?

TT-billteng commented 4 months ago

you should try looping with pytest-repeat

pip install pytest-repeat
pytest --count=1000 ....
shwetankTT commented 4 months ago

@rtawfik01 I am able to reproduce this issue on aus-glx-03 machine. The firmware are mentioned below.

image
ttmtrajkovic commented 4 months ago

@shwetankTT, @TT-billteng, @tt-rkim, @tapspatel

I've checked the aus-glx-03 machine and firmware versions running on all pci interfaces visible in the system: 0, 1, 2, 3. None of them have the latest firmware 8.12 (with FMAX = 1000MHz and VOLTAGE_MARGIN = 50mV) so this test should be repeated with updated firmware.

Could also someone explain how are these nebula_x1_galaxy systems configured, I can see 4 TT pcie devices so are those 4 separate n300 / n150 cards or n150 card connected to galaxy. If its latter, where do other pcie devices come from? That also yields the next question of how are these machines maintained - pcie device 0 has an older firmware 8.11 (reduced clock) and devices 1-3 have even older firmware, Fmax = 1GHz.

Thanks.

Milos

tapspatel commented 4 months ago

hey @ttmtrajkovic

We’ve given the onus on devs to make sure their machine is flashed with the correct versioning of fw via scripts here: https://github.com/tenstorrent/tt-metal/tree/main/scripts/install. Everything aside from galaxy. For IRD machines, a ticket needs to be filed to Devinfra. For non-ird, you can do it yourself. These tools are all using open source repos, and I haven’t yet received the di/dt patch fix for galaxy (only WH). There is an issue tracking that here: https://github.com/tenstorrent/tt-firmware/issues/6. As of right now, only 2 folks are using galaxy, and the galaxies provisioned to them are still running at the 800mhz version (until I get the patch).

In IRD, machines aus-glx-01 to aus-glx-13 are all TG systems ie 4xn150s connected to a galaxy. However, we were running low on n150 cards for developers, so instead of letting the galaxies stay idle, I’ve just disabled the nebula <-> galaxy connections, which allows each individual user to use an n150 card standalone. As work on galaxy ramps up, I’ll enable all these machines again.

You can use IRD to reserve a single n150 card or all n150 cards via ird reserve --timeout 600 wormhole_b0 --model x1_galaxy --machine aus-glx-03 --num-pcie-chips <1 or 4>

Individual n150 firmware maintenance is on devs when they use the board. Galaxy firmware maintenance is on me until patch comes along, after which ill provide helper scripts for devs to do it themselves. All other maintenance is on Devinfra.

shwetankTT commented 4 months ago

I've flashed the 8.12 firmware on aus-glx-03, and I no longer see the hang issue. I'm currently flashing the firmware on a couple of other machines to ensure that the behavior is consistent across all machines.

shwetankTT commented 4 months ago

issue has been resolved. I am keeping it open incase the issue appears again on latest FW.

davorchap commented 4 months ago

closing, please re-open a new case if the issue reappears.