ND hang of SD unit tests on N300 device

mtatsumiTT commented 4 months ago

Running SD unit tests with WH_ARCH_YAML on N300 devices non-deterministically hangs.

To repro the issue, switch to main branch and run the following on N300 device:

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/ttnn/integration_tests/stable_diffusion

EDIT: Running the same test with enabling watcher in the fast-dispatch CI raises the std::runtime_error below on tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py (full log):

terminate called after throwing an instance of 'std::runtime_error'
  what():  Read 0xffffffff from ARC scratch[6]: auto-reset succeeded.
Fatal Python error: Aborted
Thread 0x00007f3744ff9700 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 306 in wait
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 558 in wait
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007f38db2c1740 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 410 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 616 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 693 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 306 in time_sharded_attention
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 471 in get_attention_scores_opt
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 706 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_basic_transformer_block.py", line 90 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_transformer_2d.py", line 298 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attn_upblock.py", line 153 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py", line 321 in test_cross_attn_up_block_2d_512x512
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/bin/pytest", line 8 in <module>

fyi @AleksKnezevic @vtangTT @TT-billteng

TT-billteng commented 4 months ago

hanging on N150 in main post-commit

https://github.com/tenstorrent/tt-metal/actions/runs/8728371259/job/23948244160 https://github.com/tenstorrent/tt-metal/actions/runs/8739120961/job/23980015228

seems to be specifically this test

tests/ttnn/unit_tests/test_sd_e2e.py::test_unet_2d_condition_model_512x512[batch_size=2-in_channels=4-input_height=64-input_width=64]

jliangTT commented 4 months ago

some discussions are happening over here - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1713992657339109

mtatsumiTT commented 4 months ago

quick update: some one the unit tests were raising OOM and allocator errors, but e2e test was passing. I'll skip the tests with OOM errors, and launch a pipeline after rebasing to the latest main to double-check all SD unit tests pass

jliangTT commented 4 months ago

Next step: Please try to repro it on the lastest FD2/main branch.

jliangTT commented 4 months ago

Next step:

debugging/repro with watcher

AleksKnezevic commented 4 months ago

I have been able to successfully reproduce the hang running watcher without NOC sanitization three times. Twice on the same op, once on a different one. I also ran without hangs ~5 times, so still ND. The different one is in the same submodule (the one @mtatsumiTT identified as problematic), so probably related.

AleksKnezevic commented 4 months ago

To repro, run WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest --count=100 -svv tests/ttnn/integration_tests/stable_diffusion/test_unet_2d_condition_model.py -k 512 on aknezevic/hang_debug

AleksKnezevic commented 4 months ago

hang_debug.txt

jvasilje commented 4 months ago

Sounds like we should try to repro on the submodule as next step. Then we will have a smaller test to debug in detail. Less ops.

AleksKnezevic commented 3 months ago

I have found a way to reliably repro on the submodule, trying to further isolate.

AleksKnezevic commented 3 months ago

I've narrowed the hang to matmul. If I reduce the subblock to 1, the hang is gone. As @TT-BrianLiu pointed out, this does change timing, so it's still unsure whether this is an MM issue or runtime. This happens in several locations in Stable Diffusion.

I have a repro with two matmuls and an eltwise:

On aknezevic/repro_MM_hang first clear built with rm -rf built then run: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest -svv tests/ttnn/integration_tests/stable_diffusion/test_geglu.py::test_geglu_512x512[N=1-C=2-H=256-W=1280-index=1-model_name=CompVis/stable-diffusion-v1-4-device_l1_small_size=32768]

This will loop over the test 3000 times. For me, the hang usually occurs around iter 1500.

AleksKnezevic commented 3 months ago

The watcher log of the previous test is different than the one we saw previously. Perhaps a different style of hang? That one can be reproed using a 5-6 op unit test on the same branch: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest tests/ttnn/integration_tests/stable_diffusion/test_sharded_attention.py::test_time_sharded_attnention -k 4096

@jliangTT, can you please coordinate?

jliangTT commented 3 months ago

this issue is being discussed in the daily hang standup. The collective decision is to allow the FD2 merge / stabilization to take its course today and assign resource right afterward.

jliangTT commented 3 months ago

the FD2 merged over this weekend - can we rebase and re-test against the main to get the latest baseline result?

AleksKnezevic commented 3 months ago

Hang is still present on both tests. I rebased and pushed aknezevic/repro_MM_hang

tt-aho commented 3 months ago

@AleksKnezevic have you even seen a seg fault on WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest -svv tests/ttnn/integration_tests/stable_diffusion/test_geglu.py::test_geglu_512x512[N=1-C=2-H=256-W=1280-index=1-model_name=CompVis/stable-diffusion-v1-4-device_l1_small_size=32768]?

Seem to be getting it around ~150 iterations (note this was on a t3k system, will try to repro on a vm next)

AleksKnezevic commented 3 months ago

I have on occasion seen a seg fault on VM but not on BM. I've been using an N300. @tt-aho, are you clearing the built directory before running the test?

jliangTT commented 3 months ago

High bw debugging happening in slack thread (internal) - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1715102452304479

aliuTT commented 3 months ago

catching up on debug log, @jliangTT can you add me to the slack thread you linked?

ttmtrajkovic commented 3 months ago

hey @AleksKnezevic,

Could you please summarize the tests used to reproduce the problem: both a small one and the complex one? Also, is it reproducible on main?

AleksKnezevic commented 3 months ago

The small test is two matmuls and an elementwise multiply, the larger test is self attention (matmul, softmax, matmul). The tests are currently on branch aknezevic/repro_MM_hang

pavlepopovic commented 3 months ago

Fyi: https://github.com/tenstorrent/tt-metal/issues/8644 In falcon7b prefill, we've also had a problem with matmul->softmax->matmul. We needed to set subblock_h/w to 1 on both of these matmuls and furthermore to reduce number of cores allocated for matmuls to 57 to avoid di/dt issues.

TT-BrianLiu commented 3 months ago

Fyi: #8644 In falcon7b prefill, we've also had a problem with matmul->softmax->matmul. We needed to set subblock_h/w to 1 on both of these matmuls and furthermore to reduce number of cores allocated for matmuls to 57 to avoid di/dt issues.

Subblok specs shouldn't affect how many cores are used for matmul. It increases the number of loops in compute which essentially slows it down.

pavlepopovic commented 3 months ago

Fyi: #8644 In falcon7b prefill, we've also had a problem with matmul->softmax->matmul. We needed to set subblock_h/w to 1 on both of these matmuls and furthermore to reduce number of cores allocated for matmuls to 57 to avoid di/dt issues.

Subblok specs shouldn't affect how many cores are used for matmul. It increases the number of loops in compute which essentially slows it down.

Yup, those are 2 changes we needed to apply to get rid of di/dt (subblocks and grid size reduction) (Just realised that ‘furthermore’ is a bad choice of words for what I was trying to say :D)

mywoodstock commented 1 month ago

Hello, was there any resolution/fix as workaround for di/dt here? We have SD hang on CI n300. Also, I see that this was downgraded to P2 a couple months ago, is that still applicable with the WH launch readiness?

tt-rkim commented 1 month ago

By the way, doing some bisecting, looks like https://github.com/tenstorrent/tt-metal/commit/db25a356488a5e4b10752cdb87189c0e33aa238d is what causes the recent additional hang (on top of the di/dt hang)

evidence: https://github.com/tenstorrent/tt-metal/actions/runs/9967178513/job/27540588230

tt-aho commented 1 month ago

@tt-rkim how sure are you that this commit is the issue? Is it deterministically reproducible and how often have you seen it on this commit vs previous commits?

tt-rkim commented 1 month ago

So one thing is I had to extend the timeout on branch because after Vincent's pytest changes, 300s was too short so initially SD was just timing out because it normally takes 10min+

However, after extending it, I bisected near the region of commits where it started failing on N300, and it always hung on yours or after.

To be fair, what I said is heuristic evidence. I've told @mywoodstock that he could try lowering FMAX on a WH X2 VM to 500MHz and try the test in that region of commits with an extended timeout (I recommend 900s to be sure).

s-jovic commented 1 month ago

I was trying to reproduce a hang running SD demo yesterday on N300 machine - on main it hangs regardless of the workarounds (SLOW_MATMUL env var), so then I checked out the commit before the one @tt-rkim mentioned and I wasn't able to get a hang in 16 iterations of a full demo.. I also wasn't able to reproduce a hang running the integration tests without the workarounds.

Do you plan to revert/fix the changes from the problematic commit, so that we can try and reproduce di/dt hang with more recent code?

tt-aho commented 1 month ago

To clarify, are you saying commit 3e7273c45b196881d39c9fe5693bc2c70706bcd2 passes but db25a35 does not? Or did you also revert 3e7273c45b196881d39c9fe5693bc2c70706bcd2?

s-jovic commented 1 month ago

I am saying https://github.com/tenstorrent/tt-metal/commit/e7a43ccf0852eedfda983441abe6853892148143 passes (commit before https://github.com/tenstorrent/tt-metal/commit/3e7273c45b196881d39c9fe5693bc2c70706bcd2), I assumed the two commits (https://github.com/tenstorrent/tt-metal/commit/3e7273c45b196881d39c9fe5693bc2c70706bcd2 and https://github.com/tenstorrent/tt-metal/commit/db25a356488a5e4b10752cdb87189c0e33aa238d) were a part of the same change, I wasn't clear.

mywoodstock commented 1 month ago

Yes, so even after lowering the FMAX to 500MHz, the runs still consistently hang on main. I was unable to get the commit before https://github.com/tenstorrent/tt-metal/commit/3e7273c45b196881d39c9fe5693bc2c70706bcd2 working -- i get some weird python errors :(.

But yes, so the conclusion is that the didt issue is fixed with the new FW (no hangs), and the commit https://github.com/tenstorrent/tt-metal/commit/3e7273c45b196881d39c9fe5693bc2c70706bcd2 looks like the issue. Correct?

tt-aho commented 1 month ago

I am taking a look into why this commit would cause issues specifically for SD, unless it is causing timing issues for a different bug/problem or SD is hitting some kind of very specific corner case. Currently it is not ideal to try and revert these changes.

mywoodstock commented 1 month ago

Yeah, it would most likely be timing issue -- the thing is I see it hang in between iterations .. .either between 3-4 or 4-5 usually.

s-jovic commented 1 month ago

I was unable to get the commit before https://github.com/tenstorrent/tt-metal/commit/3e7273c45b196881d39c9fe5693bc2c70706bcd2 working -- i get some weird python errors :(

There was an assert that was commented out later on: https://github.com/tenstorrent/tt-metal/commit/d46d435e9a3cbe5d802fc5d6bb02fb261c52838cm, I just commented it out, and it worked.

so the conclusion is that the didt issue is fixed with the new FW (no hangs)

Overall, the new fw helps a lot, most of the repro examples we had are fixed, but we still have a couple of remaining issues still hanging less frequently.

Regarding SD specifically, the small repro examples we had from this thread are fixed, however I was unable to reproduce a hang running a full demo even with the old firmware (80.8.12) on the https://github.com/tenstorrent/tt-metal/commit/e7a43ccf0852eedfda983441abe6853892148143 commit. So I can't really say nd hangs from demo are resolved with new firmware, when I couldn't catch the nd hang in the old setup either.

tt-rkim commented 1 month ago

Is the previous di dt issue exposed more by host constraints? For example, maybe the machine @s-jovic was using is too powerful? We have less powerful CI VMs we could try running this on.

And how many iterations did you try? @s-jovic

s-jovic commented 1 month ago

Is the previous di dt issue exposed more by host constraints?

Shouldn't be, but it does depend on the chips - N300 or N150 chips aren't identical, some expose hangs more often than the others. I was using a chip that was proved to expose a lot of hang repros we already have, but it doesn't have to mean it can expose all di/dt hangs that can happen.

And how many iterations did you try?

I ran 15 iterations, the demo takes a bit longer.

mywoodstock commented 1 month ago

Might be good to test the full demo with 50 iters?

tt-aho commented 1 month ago

I've isolated the specific change that is causing the hang, though still unknown why it's causing it.

This commit changed which FD core is used for what, specifically eth core (0, 4) was used for dispatcher, and (0, 5) was used for prefetcher. Was changed so that (0, 4) is now prefetcher and (0, 5) is now dispatcher. This should have no real change in functionality and other tests/models are functional, so something weird is happening as a result of this (potentially some timing/race issue).

tt-aho commented 1 month ago

I have a fix for this in this pr #10911. Didn't enable the test in CI though. Will you enable it after doing di/dt testing on latest main @s-jovic ?

s-jovic commented 1 month ago

After the fix is merged, I guess it should be checked if the ND hangs still exist or not with the latest fw. However, we won't be doing these checks and removing workarounds for all models, model owners will need to check if they can remove workarounds from the models. We will let everybody know once the software fix and fw testing is done, so that model owners can addess this.

tt-aho commented 4 weeks ago

The fix is on main. So retesting/re-enabling of SD tests can be done

tt-rkim commented 3 weeks ago

Should we re-enable the unstable tests?

esmalTT commented 3 weeks ago

Assigning myself to this since I'm taking ownership of SD for now. I'll test out @tt-aho's fix to see if we can re-enable these tests on CI.

esmalTT commented 2 weeks ago

Tests were re-enabled in 9492740 and no longer hang on N300 due to @tt-aho's fix. From my testing, we still require SLOW_MATMULS=1 to avoid hanging.

Should we close this or keep it open until the di/dt issues are completely resolved?

There is another spurious failure (see here) but I will track that in a separate issue.

tt-rkim commented 2 weeks ago

Unless others have objections, I think we can close this specific issue.

Should we make a follow up to eventually get rid of SLOW_MATMULS from the stack once we no longer support WH?

esmalTT commented 2 weeks ago

Unless others have objections, I think we can close this specific issue.

Should we make a follow up to eventually get rid of SLOW_MATMULS from the stack once we no longer support WH?

Yes I'll create an issue to track it 👍 Unless some says otherwise, I will close this once I create the 2 follow on issues.

tenstorrent / tt-metal

ND hang of SD unit tests on N300 device #7560