pytorch / test-infra

This repository hosts code that supports the testing infrastructure for the main PyTorch repo. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
https://hud.pytorch.org/
Other
77 stars 74 forks source link

[Dr CI] Wrong classification for flaky jobs #5540

Open clee2000 opened 1 month ago

clee2000 commented 1 month ago

Similar failure links back experimental split build job that is marked as unstable on this PR

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/132118

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (3 Unrelated Failures)

As of commit 8c9ec980fcffa6a50ca475e5b7cb372819803961 with merge base 91df66ee74004fe5bf050aad30c8915f81c4e870 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

* [pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/132118#28418151450) ([gh](https://github.com/pytorch/pytorch/actions/runs/10270036974/job/28418151450)) ([related job](https://hud.pytorch.org/pytorch/pytorch/commit/8c9ec980fcffa6a50ca475e5b7cb372819803961#28418234882)) `distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1` * [pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, amz2023.linux.2xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/132118#28417668459) ([gh](https://github.com/pytorch/pytorch/actions/runs/10270036974/job/28417668459)) ([related job](https://hud.pytorch.org/pytorch/pytorch/commit/8c9ec980fcffa6a50ca475e5b7cb372819803961#28418234882)) `distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1` * [trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/132118#28418234882) ([gh](https://github.com/pytorch/pytorch/actions/runs/10270038730/job/28418234882)) ([#129539](https://hud.pytorch.org/pytorch/pytorch/issues/129539)) `distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

clee2000 commented 1 month ago

Another case here

test_fully_shard_init failure is real and the related job links back to the experimental split build job on the same commit

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/132934

Note: Links to docs will display an error until the docs builds have been completed.

:x: 3 New Failures, 2 Unrelated Failures

As of commit 6b383fcc1295e19a4aa44bc3f7a68065227bf4f8 with merge base e16276b9bf9e7c5cfcfd8242d336b26eb7dd182f (image):

NEW FAILURES - The following jobs have failed:

* [pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/132934#28499826858) ([gh](https://github.com/pytorch/pytorch/actions/runs/10296939299/job/28499826858)) `test_jit.py::TestScript::test_function_overloading_isinstance` * [pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 3, 3, linux.2xlarge)](https://hud.pytorch.org/pr/pytorch/pytorch/132934#28499659973) ([gh](https://github.com/pytorch/pytorch/actions/runs/10296939299/job/28499659973)) `test_jit.py::TestScript::test_function_overloading_isinstance` * [trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)](https://hud.pytorch.org/pr/pytorch/pytorch/132934#28500379454) ([gh](https://github.com/pytorch/pytorch/actions/runs/10296940162/job/28500379454)) `functorch/test_aotdispatch 1/5 failed!`

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

* [pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/132934#28500239869) ([gh](https://github.com/pytorch/pytorch/actions/runs/10296939299/job/28500239869)) ([related job](https://hud.pytorch.org/pytorch/pytorch/commit/6b383fcc1295e19a4aa44bc3f7a68065227bf4f8#28500034829)) `distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter` * [trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/132934#28500034829) ([gh](https://github.com/pytorch/pytorch/actions/runs/10296940162/job/28500034829)) ([#129539](https://hud.pytorch.org/pytorch/pytorch/issues/129539)) `distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter`

This comment was automatically generated by Dr. CI and updates every 15 minutes.