pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.5k stars 608 forks source link

Retry tests #3229

Closed leej3 closed 1 month ago

leej3 commented 3 months ago

@vfdev-5

This is an attempt to rerun some of the CI tests using the retry github action from nick-fields. The objective is to rerun steps in the GitHub workflows that have failed due to flaky tests or tests that hang. Using the pytest argument --last-failed is helpful as passed tests are not re-executed across reruns.

An edge-case that is trickier to handle is when the first pytest invocation passes and the subsequent one fails. In some circumstances this will lead to all tests being executed for the first invocation. The pytest argument "--last-failed-no-failures none" can help here though I have observed undesirable behaviour in some cases using this argument (see reproducer below). Sometimes it skips all tests when I would expect it (or at least hope) to run some tests. Additionally, if all tests are deselected it emits an exit code of 5 in bash. For now I catch and ignore this for the first invocation.

Bash script to set up reproducer
``` #!/bin/bash # Create the directory if it doesn't exist pytest --cache-clear || true rm -rf lf-behaviour mkdir -p lf-behaviour # Create the Python test file cat << EOF > lf-behaviour/test_example.py #!/usr/bin/env python3 import pytest @pytest.fixture def cache(pytestconfig): return pytestconfig.cache def test_common_pass(): assert True def test_common_flaky_fail(cache): retrieved_cache = cache.get("common_flaky_fail",0) if not retrieved_cache > 2: cache.set("common_flaky_fail", retrieved_cache +1) assert False # Passes after the first failure assert True def test_unique1_pass(cache): assert True def test_unique1_flaky_fail(cache): retrieved_cache = cache.get("unique1_flaky_fail",0) if not retrieved_cache > 2: cache.set("unique1_flaky_fail", retrieved_cache +1) assert False # Passes after the first failure assert True def test_unique2_pass(): assert True def test_unique2_fail(): assert False EOF # Create the Python test file unique to the first run cat << EOF > lf-behaviour/test_example_unique3.py #!/usr/bin/env python3 import pytest @pytest.fixture def cache(pytestconfig): return pytestconfig.cache def test_separately_unique3_pass(cache): assert True def test_separately_unique3_fail(cache): retrieved_cache = cache.get("unique3_flaky_fail",0) if not retrieved_cache > 2: cache.set("unique3_flaky_fail", retrieved_cache +1) assert False # Passes after the first failure assert True EOF # Create the Python test file unique to the second run cat << EOF > lf-behaviour/test_example_unique4.py #!/usr/bin/env python3 import pytest def test_separately_unique4_pass(): assert True def test_separately_unique4_fail(): assert False EOF # Create the shell script cat << EOF > lf-behaviour/run_unique_individually.sh #!/bin/bash set -ex pytest \${EXTRA_PYTEST_ARGS} -k "unique1" test_example.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;} pytest \${EXTRA_PYTEST_ARGS} test_example_unique3.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;} pytest \${EXTRA_PYTEST_ARGS} -k "unique2" test_example.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;} pytest \${EXTRA_PYTEST_ARGS} test_example_unique4.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;} EOF # Create the shell script cat << EOF > lf-behaviour/run_tests.sh #!/bin/bash set -ex pytest \${EXTRA_PYTEST_ARGS} -k "test_common or unique1 or unique3" . || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;} pytest \${EXTRA_PYTEST_ARGS} -k "test_common or unique2 or unique4" . || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;} EOF cd lf-behaviour bash run_unique_individually.sh # Echo the command to run the tests echo "To run the tests, use the following command (you can also add pytest args --cache-clear and --last-failed-no-failures [all/none]):" echo cd lf-behaviour echo EXTRA_PYTEST_ARGS="--last-failed --last-failed-no-failures none" bash run_unique_individually.sh ```

That leaves one remaining problem for this approach to be usable (i.e. this PR could be merged): currently some of the pytest invocations hang. When this happens the tests that were not executed are automatically determined to have passed which can result in most of the test-suite being skipped completely. I haven't thought of a way to address this.

The simpler solution of removing these extra pytest arguments would work but the timeout for the full job needs to be increased a good bit more than what it is currently (45 mins).

EDIT: opting for the simpler solution for now. Perhaps using pytest-timeout could help with test hanging. Previous work here EDIT 2: using separate pytest cache directories for different runs/executions of pytest on a particular CI machine allows the --last-failed functionality to be used effectively.

leej3 commented 2 months ago

This is working reasonably well now.

Also, I abandoned the "--dist=loadgroup" due to the various things hanging. I can't figure out what's causing it but if we could turn that on for the linux tests that would drop the time it takes to execute them considerably.

vfdev-5 commented 2 months ago

Do you know why this one is still faling: https://github.com/pytorch/ignite/actions/runs/8878484399/job/24374218672?pr=3229 ?

one potential issue that I see locally is that tests that cause an error don't get rerun.

Can you elaborate this one?

leej3 commented 2 months ago

Do you know why this one is still faling

After all tests completed a "shutting down" message from Neptune was displayed and that hung for another 10mins or so.

Currently when tests hang they get restarted by the GitHub action at the 25minute mark. That means if a particular machine hangs twice it is cancelled due to the 45minute timeout. We could adjust it down to 20mins to improve chances of success but overall figuring out the causes of hanging is the ideal.

Regarding the errors: if an invocation of pytest has failures (invalid assertions etc) and errors (failed fixture execution etc) it seems that pytest reruns the failures but ignores the errors. It's likely a deal breaker... I haven't thought of a work around yet. (EDIT: I think I was wrong about this, a more thorough investigation shows errors get rerun)

leej3 commented 2 months ago

I think this is ready to go now. Overall it:

Test failures were too frequent to avoid using the "--last-failed" functionality of pytest. This introduced some extra details like needing to specify the pytest cache directory, trapping an exit code of 5 when a previous pytest invocation had no failures. To keep things maintainable I moved these details into a function shared across all scripts.

leej3 commented 1 month ago

Success!

A significant issue with retrying the test suite was that the retry action sends a sigterm which pytest completely ignores. The result was overlapping pytest sessions that caused a bit of a mess. I abandoned a failed strategy of trying to kill (sigint and then sigkill) the pytest process upon retrying. It is easier to modify how pytest interprets the sigterm (it’s worth noting this is modified in case it causes other undesirable behaviour though).

Regardless of how the test session is interrupted, one tricky detail was to include “unrun” tests in subsequent retries. These tests are not run because a session hangs and is interrupted: Pytest appropriately classifies these tests as not failed but for our purposes we do in fact want to include them in the next run of the pytest —last-failed. My solution was to add an option “—treat-unrun-as-failed” that classifies such tests as failed for subsequent runs.

The overall PR summary: