Add CI backstop timeout to limit hanging jobs

tdyas commented 1 week ago

This job failed after six hours or so. CI should have a backstop timeout in place so hanging jobs fail much earlier than that. Say one hour or so.

jsirois commented 1 week ago

Since the jobs can take up to an hour under normal circumstances (pypy IT shards), one hour would be too small, I'd want to have the timeout be at least 2 hours.

What I really don't want is to bandaid over good signal. Before considering this I'd like to examine logs from the job you've linked mac-it-1-of-2.log.txt and see if the problem is a bad test or even bad production code and actually fix the real problem. I've observed hung mac shards over the past few months so I'd like to spend more time ruling me in or out as the problem before throwing up my hands and blaming GH hosted macs. I firmly suspect PEBKAC here.

jsirois commented 6 days ago

Ok, used this on mac-it-1-of-2.log.txt above:

#!/usr/bin/env python3

import os
import re
import sys
from pathlib import Path
from typing import Any

def analyze(log: Path) -> Any:
    tests: dict[str, bool] = {}
    with log.open() as fp:
        for line in fp:
            # E.G.: 2024-11-13T06:29:33.3456360Z tests/integration/test_issue_1018.py::test_execute_module_alter_sys[ep-function-zipapp-VENV]
            match = re.match(r"^.*\d+Z (?P<test>tests/\S+(?:\[[^\]]+\])?).*", line)
            if match:
                test = match.group("test")
                if test not in tests:
                    tests[test] = False
                continue

            # E.G.: 2024-11-13T06:29:33.3478200Z [gw3] PASSED tests/integration/venv_ITs/test_issue_1745.py::test_interpreter_mode_python_options[-c <code>-VENV]
            match = re.match(r"^.*\d+Z \[gw\d+\] [A-Z]+ (?P<test>tests/\S+(?:\[[^\]]+\])?).*", line)
            if match:
                tests[match.group("test")] = True
                continue

    hung_tests = sorted(test for test, complete in tests.items() if not complete)
    if hung_tests:
        return f"The following tests never finished:\n{os.linesep.join(hung_tests)}"

def main() -> Any:
    if len(sys.argv) != 2:
        return f"Usage: {sys.argv[0]} <CI log file>"

    log = Path(sys.argv[1])
    if not log.exists():
        return f"The log specified at {sys.argv[0]} does not exist."

    return analyze(Path(sys.argv[1]))

if __name__ == "__main__":
    sys.exit(main())

And it yielded:

:; ./detect-hung.py mac-it-1-of-2.log.txt
The following tests never finished:
tests/integration/test_issue_2186.py::test_incompatible_resolve_error

jsirois commented 6 days ago

Ok, that's just 1 data-point. I'd like to get a few 6 hour timeouts on mac shards analyzed to see if its the same test hanging (looking at that test, it seems super unlikely that test in particular is hangy, but). I'll get this script checked in.

jsirois commented 6 days ago

Ok, available on main here: https://github.com/pex-tool/pex/commit/7e3c248b700732a4259122be46c01d5017115b89

jsirois commented 4 days ago

Ok, and the 1 guess I had as to where a hang could be produced in Pex code has a fix in on main at 30c2ec8e. I won't close this, but I'll remove the in-progress label. If there are no similar issues in a month or so, I'll close. If there are, I'll try the new script to analyze which test is hung and see if it matches the data point above,

pex-tool / pex

Add CI backstop timeout to limit hanging jobs #2593