tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

Allgather Test Suite Occassionally Sees Very Long Test Cases (Non-deterministic, Time Out) #8603

Open SeanNijjar opened 4 months ago

SeanNijjar commented 4 months ago

The all-gather test suite will non-deterministically hang after several successful post-commit runs. (Update: Not actually a hang - just a very slow operation that occasionally pops up and causes the test to timeout "early" -- see later comments. I think this also means this is likely not an allgather issue)

For example, I saw the following failures on the 3rd post commit run after 2 successful ones.

ERROR tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[True-100-mem_config1-input_dtype1-8-1-input_shape6-3-layout6] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[True-100-mem_config1-input_dtype1-4-2-input_shape0-0-layout0] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config0-input_dtype1-8-1-input_shape1-0-layout1] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config1-input_dtype0-8-1-input_shape2-3-layout2] - Failed: Timeout >2400.0s
FAILED tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[False-100-mem_config1-input_dtype1-4-2-input_shape0-0-layout0] - Failed: Timeout >2400.0s

For reference, here are the pytest parametrizations for post_commit_looping since the post commit on main will likely look different:

@pytest.mark.parametrize(
    "num_devices, num_links, input_shape, dim, layout",
    [
        (4, 2, [4, 1, 256, 32], 0, ttl.tensor.Layout.TILE),
        (8, 1, [8, 1, 256, 32], 0, ttl.tensor.Layout.TILE),
        (8, 1, [1, 1, 32, 16384], 3, ttl.tensor.Layout.TILE),
        (4, 2, [1, 1, 32, 32768], 3, ttl.tensor.Layout.TILE),
        (4, 2, [4, 1, 256, 32], 0, ttl.tensor.Layout.ROW_MAJOR),
        (8, 1, [8, 1, 256, 32], 0, ttl.tensor.Layout.ROW_MAJOR),
        (8, 1, [1, 1, 32, 16384], 3, ttl.tensor.Layout.ROW_MAJOR),
        (4, 2, [1, 1, 32, 32768], 3, ttl.tensor.Layout.ROW_MAJOR),
    ],
)
@pytest.mark.parametrize(
    "input_dtype",
    [
        ttl.tensor.DataType.BFLOAT16,
        ttl.tensor.DataType.BFLOAT8_B,
    ],
)
@pytest.mark.parametrize(
    "mem_config",
    [
        ttl.tensor.MemoryConfig(buffer_type=ttl.tensor.BufferType.DRAM),
        ttl.tensor.MemoryConfig(buffer_type=ttl.tensor.BufferType.L1),
    ],
)
@pytest.mark.parametrize("num_iters", [100])  # TODO: restore to 500
@pytest.mark.parametrize("enable_async", [True, False])

There doesn't seem to be a pattern between allgather config (shape, datatype, async mode, mem config) and a hang presenting. At this time I have no indication about the source of the hang (op vs infra vs something else). Interestingly, I have successfully run 1.5M iterations of an allgather config successfully (8, 1, [1, 1, 32, 32768], L1, fp16) but at 800MHz.

SeanNijjar commented 4 months ago

Was so far unable to reproduce the hang with a more isolated test list. Things I've tried so far:

1) Run the above configs for 100k iterations each -> No hangs detected 2) Various subset of post_commit_looping tests run in loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20) -> No hangs detected 3) Run the post_commit_looping test in a loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20) -> No hangs detected 4) Run the post_commit_looping test in a loop (invoke https://pypi.org/project/pytest-repeat/ with --count=20), but where each test only has num_iters=1 -> No hangs detected

Given that individual allgather configs can easily run 100k iterations without hangs (I've also had multiple succesful 1M+ runs in past days, but at 800MHz), I think this hang may have something to do with running different configurations back to back. I think maybe there aren't enough configs in post_commit_looping to expose whatever this bug is.

I'll try again when I've got some machine downtime (i.e. when I'm doing dev work as opposed to something like active debug.

tapspatel commented 4 months ago

Setup stress test pipeline

branch: t3000-stress-pipeline test_file: tt-metal/tests/scripts/t3000/run_t3000_stress_tests.sh pipeline: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-stress-tests.yaml

I added a tt-smi-metal -r 0,1,2,3 which will reset between tests, you can use it to lazily submit many jobs with 1000s of iterations and reset between to ensure good board state

SeanNijjar commented 4 months ago

I ran the allgather post-commit suite overnight in a loop with a really long timeout so I could capture the machine in hang state without any device or dispatcher teardown and found something really unexpected.

The post commit tests are still running (no hangs). This is suspicious because I was able to reproduce a hang after a couple hours with other attempts.

I found somethng interesting. I don't think we have a real hang here and instead we have some pathological behaviour that's causing really slow behaviour. Here's a couple snapshots from my log that show a multi-hour delay between adjacent test cases:

Screenshot 1

image

Screenshot 2

image

Screenshot 3

image

TL;DR: Not a real hang!? Instead some pathological behaviour that causes ridiculous slowdown in some part of what looks like readback?

This reminds me of this issue (#6212) I was seeing a little while ago in that I wouldn't be able to reliably see the pathological behaviour deterministically and sometimes only after a couple runs. I wonder if the issue also popped up for the smaller shapes (like I see in a couple of the screen shots above), but I just never noticed because the extra memory never ate into swap. I wonder if the two are related

(Update on above: I realized I was running in debug mode. However I tried again in release mode and saw similar behaviour -> see next comment.)

SeanNijjar commented 4 months ago

From my release build run, I'm seeing the same thing (albeit in different places):

Image 1

image

Image 2

image

Run still in progress but this is pretty reassuring that we're not actually seeing a hang at all, just some really slow operation somewhere.

SeanNijjar commented 4 months ago

FYI @cfjchu, @tt-aho, @tt-asaigal since this has the potential to be related to something in the runtime and I know you guys have been dealing with difficulties related to pytorch recently. Putting it on your radar in case you have any ideas or see things in the future that could be related.