python-trio / trio

Trio – a friendly Python library for async concurrency and I/O
https://trio.readthedocs.io
Other
6.19k stars 341 forks source link

Tracking issue: intermittent test failures #200

Open njsmith opened 7 years ago

njsmith commented 7 years ago

It's the nature of I/O libraries like Trio that their test suites are prone to weird intermittent failures. But they're often hard to track down, and usually the way you encounter them is that you're trying to land some other unrelated feature and the CI randomly fails, so the temptation is to click "re-run build" and worry about it later.

This temptation must be resisted. If left unchecked, you eventually end up with tests that fail all the time for unknown reasons and no-one trusts them and it's this constant drag on development. Flaky tests must be eradicated.

But to make things extra fun, there's another problem: CI services genuinely are a bit flaky, so when you see a weird failure or lock-up in the tests then it's often unclear whether this is a bug in our code, or just some cloud provider having indigestion. And you don't want to waste hours trying to reproduce indigestion. Which means we need to compare notes across multiple failures. Which is tricky when I see one failure, and you see another, and neither of us realizes that we're seeing the same thing. Hence: this issue.

What to do if you see a weird test failure that makes no sense:

Issues we're monitoring currently

njsmith commented 7 years ago

Regarding the weird pypy nightly freeze in test_local: I downloaded the pypy-c-jit-91601-609a3cdf9cf7-linux64 nightly, and have let it loop running the trio testsuite for the last a few hours on my laptop, and I haven't been able to reproduce the problem so far. (Though I did get two failures in test_ki.py, both on this line, which is getting "did not raise KeyboardInterrupt". The test sessions otherwise finished normally.)

njsmith commented 7 years ago

119 is now fixed, I think / hope. Demoting it to "on the radar".

njsmith commented 7 years ago

Got fed up and fixed #140 :-)

njsmith commented 7 years ago

140 came back. This makes no sense at all.

njsmith commented 7 years ago

Freeze in test_local.py::test_run_local_simultaneous_runs on pypy3 5.8 – maybe just travis being flaky, maybe something more.

njsmith commented 7 years ago

Here's a weird one: https://travis-ci.org/python-trio/trio/jobs/298164123

It looks like our test for CPython 3.6.2 on MacOS, one of our calls to the synchronous, stdlib function SSLSocket.unwrap() raised an SSLWantWriteError. Which should be impossible for a synchronous call, I think? Maybe this is some weird intermittent bug in the stdlib ssl module?

njsmith commented 6 years ago

Here's a new one, I guess somehow introduced by #358: a timeout test failing on windows because a 1 second sleep is measured to take just a tinnnnny bit less than 1 second. https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21

361 attempts a diagnosis and fix.

njsmith commented 6 years ago

The weird SSL failure happened again: https://travis-ci.org/python-trio/trio/jobs/311618077 Again on CPython 3.6 on MacOS.

Filed upstream as bpo-32219. Possibly for now we should ignore SSLWantWriteError there as a workaround.

Edit: #365 is the PR for ignoring it.

njsmith commented 6 years ago

Another freeze on PyPy nightly in tests/test_local.py::test_run_local_simultaneous_runs: https://travis-ci.org/python-trio/trio/jobs/319497598

Same thing happened on Sept. 7, above: https://github.com/python-trio/trio/issues/200#issuecomment-327711926 And back in June: https://github.com/python-trio/trio/issues/200#issuecomment-309187983

Filed a bug: #379

njsmith commented 6 years ago

Sigh: https://github.com/python-trio/trio/issues/447

njsmith commented 6 years ago

508

njsmith commented 6 years ago

There was a mysterious appveyor build failure here: https://github.com/python-trio/trio/pull/535

pquentin commented 6 years ago

Strange PyPy nightly failure: https://travis-ci.org/python-trio/trio/jobs/391945139

Since it happened on master, I can't close/reopen a PR, but restarting the job produced the same effects.

(I think someone restarted the job above and it finally worked: the job is now green.)

Fuyukai commented 6 years ago

Jenkins keeps eating pull requests with a segfault (but only sometimes). Looks like a bug in immutable library - but can't reproduce it locally, and I don't know how to get the core dump.

njsmith commented 6 years ago

Here's a log with the segfault on Jenkins: https://ci.cryptography.io/blue/rest/organizations/jenkins/pipelines/python-trio/pipelines/trio/branches/PR-575/runs/2/nodes/6/steps/33/log/?start=0

The crash-handler traceback shows it as happening on line 27 of contextvars/__init__.py, which seems to be:

        self._data = immutables.Map()

And immutables.Map is a complex data structure implemented in C, so agreed that this smells like a bug in that code.

Filed a bug upstream here: https://github.com/MagicStack/immutables/issues/7

njsmith commented 6 years ago

The Jenkins thing has been worked around by #583... but now that we're running MacOS tests on Travis, we get #584 instead.

njsmith commented 6 years ago

a segfault in pypy 3.6 nightly, apparently related to the faulthandler timeout firing in trio/tests/test_ssl.py::test_renegotiation_randomized: https://travis-ci.org/python-trio/trio/jobs/433409577

Reported it on the #pypy irc channel anyway, though there's not much to go on yet

njsmith commented 6 years ago

Another strange pypy 3.6 nightly faulthandler traceback: https://travis-ci.org/python-trio/trio/jobs/436962955

I don't really understand this one at all.

njsmith commented 5 years ago

An appveyor failure, in test_open_tcp_listeners_backlog: https://ci.appveyor.com/project/njsmith/trio/builds/21430330/job/orp078dryaqrninr

Apparently we set the nominal backlog to 1 and the measured backlog was 15, while we assert that it's <= 10.

Edit (2019-02-05): another failure in the backlog test, this time on macOS: https://dev.azure.com/python-trio/trio/_build/results?buildId=150&view=logs

We asked for 11, but got 5.

njsmith commented 5 years ago

A failure in test_subprocess.test_signals on macOS: #851

njsmith commented 5 years ago

A one second sleep took 1.39 seconds on azure pipelines's macos py35: https://dev.azure.com/python-trio/trio/_build/results?buildId=598&view=logs

I guess we need to bump up that threshold...

njsmith commented 5 years ago

Azure Pipelines macOS 3.5 strikes again: https://dev.azure.com/python-trio/trio/_build/results?buildId=806&view=logs&jobId=93056758-5bfb-5750-f113-e720ddefdb4c

Here we tried sleeping for 1 second, and the process actually slept for >2 seconds. [Edit: Actually we tried sleeping for 0.1 second, and the process actually slept for >0.2 seconds.] So our automatic retry process kicked in... and it happened 4 times in a row, so the test still failed.

njsmith commented 5 years ago

Maybe we can make those sleep tests more robust by not just retrying, but first sleeping a bit before retrying, in order to give any noisy neighbors a chance to settle down?

njsmith commented 5 years ago

And again: https://dev.azure.com/python-trio/trio/_build/results?buildId=841&view=logs&jobId=1448839f-898c-5b46-2db8-b1e2e0090bd8

pquentin commented 4 years ago

I've seen https://github.com/python-trio/trio/issues/1170 again.

pquentin commented 4 years ago

https://dev.azure.com/python-trio/trio/_build/results?buildId=1498&view=logs&jobId=4bc7ef66-8a59-503f-fa7a-b9533d6853d5&j=4bc7ef66-8a59-503f-fa7a-b9533d6853d5&t=541e27d3-d681-5791-1ce2-84a4e949a5a6 failed:

+ mkdir empty
mkdir: cannot create directory 'empty': File exists

I see this all the time locally when running ci.sh multiple times, but the empty directory is not supposed to exist in our continuous integration. Maybe there's some sort of cache?

njsmith commented 4 years ago

mkdir: cannot create directory 'empty': File exists

Uh... there isn't supposed to be any kind of cache... azure does have a caching feature, but you have to enable it explicitly and tell it to cache specific things; it shouldn't be caching random files inside the source checkout. This looks super broken.

I guess we could attempt to debug it by adding some ls calls and waiting to see if it happens again. But if this is also causing trouble for you locally, maybe the pragmatic thing would be to make it rm -rf empty; mkdir empty and not worry about it?

njsmith commented 4 years ago

mkdir: cannot create directory 'empty': File exists strikes again: https://github.com/python-trio/trio/pull/1396#issuecomment-588316930

wgwz commented 4 years ago

What if we change it to mkdir -p?

       -p, --parents
              no error if existing, make parent directories as needed
njsmith commented 4 years ago

Sure, that might work too. We're just flailing in the dark here, so it's probably too optimistic to hope we can make rational fine-grained choices between similar options :-)

njsmith commented 4 years ago

Recently hit on Github Actions, Windows (3.8, x64, with non-IFS LSP):

2020-04-29T19:02:16.5149922Z ================================== FAILURES ===================================
2020-04-29T19:02:16.5150729Z ____________________ test_WaitForMultipleObjects_sync_slow ____________________
2020-04-29T19:02:16.5150934Z 
2020-04-29T19:02:16.5151963Z     @slow
2020-04-29T19:02:16.5152264Z     async def test_WaitForMultipleObjects_sync_slow():
2020-04-29T19:02:16.5152519Z         # This does a series of test in which the main thread sync-waits for
2020-04-29T19:02:16.5153005Z         # handles, while we spawn a thread to set the handles after a short while.
2020-04-29T19:02:16.5153264Z     
2020-04-29T19:02:16.5153831Z         TIMEOUT = 0.3
2020-04-29T19:02:16.5154042Z     
2020-04-29T19:02:16.5154267Z         # One handle
2020-04-29T19:02:16.5154463Z         handle1 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5154710Z         t0 = _core.current_time()
2020-04-29T19:02:16.5154947Z         async with _core.open_nursery() as nursery:
2020-04-29T19:02:16.5155174Z             nursery.start_soon(
2020-04-29T19:02:16.5155411Z                 trio.to_thread.run_sync, WaitForMultipleObjects_sync, handle1
2020-04-29T19:02:16.5155659Z             )
2020-04-29T19:02:16.5156204Z             await _timeouts.sleep(TIMEOUT)
2020-04-29T19:02:16.5156565Z             # If we would comment the line below, the above thread will be stuck,
2020-04-29T19:02:16.5156788Z             # and Trio won't exit this scope
2020-04-29T19:02:16.5157024Z             kernel32.SetEvent(handle1)
2020-04-29T19:02:16.5157228Z         t1 = _core.current_time()
2020-04-29T19:02:16.5157443Z         assert TIMEOUT <= (t1 - t0) < 2.0 * TIMEOUT
2020-04-29T19:02:16.5157672Z         kernel32.CloseHandle(handle1)
2020-04-29T19:02:16.5157902Z         print('test_WaitForMultipleObjects_sync_slow one OK')
2020-04-29T19:02:16.5158129Z     
2020-04-29T19:02:16.5158349Z         # Two handles, signal first
2020-04-29T19:02:16.5158722Z         handle1 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5159122Z         handle2 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5159386Z         t0 = _core.current_time()
2020-04-29T19:02:16.5159658Z         async with _core.open_nursery() as nursery:
2020-04-29T19:02:16.5159891Z             nursery.start_soon(
2020-04-29T19:02:16.5160368Z                 trio.to_thread.run_sync, WaitForMultipleObjects_sync, handle1,
2020-04-29T19:02:16.5160603Z                 handle2
2020-04-29T19:02:16.5160837Z             )
2020-04-29T19:02:16.5161061Z             await _timeouts.sleep(TIMEOUT)
2020-04-29T19:02:16.5161320Z             kernel32.SetEvent(handle1)
2020-04-29T19:02:16.5161549Z         t1 = _core.current_time()
2020-04-29T19:02:16.5161867Z >       assert TIMEOUT <= (t1 - t0) < 2.0 * TIMEOUT
2020-04-29T19:02:16.5162633Z E       assert (70981.67862983143 - 70980.60035523142) < (2.0 * 0.3)
2020-04-29T19:02:16.5162845Z 
2020-04-29T19:02:16.5163600Z C:\hostedtoolcache\windows\Python\3.8.2\x64\Lib\site-packages\trio\tests\test_wait_for_object.py:106: AssertionError
2020-04-29T19:02:16.5164689Z ---------------------------- Captured stdout call -----------------------------
2020-04-29T19:02:16.5164985Z test_WaitForMultipleObjects_sync_slow one OK
2020-04-29T19:02:16.5173571Z ============================== warnings summary ===============================
njsmith commented 4 years ago

Here's an interesting example of a flaky test: test_run_in_worker_thread_abandoned wants to make sure that a background thread didn't spew any traceback on the console. But, it gets a false positive because of random ResourceWarnings being spewed on the console at just the wrong time:

        # Make sure we don't have a "Exception in thread ..." dump to the console:
        out, err = capfd.readouterr()
>       assert not out and not err
E       AssertionError: assert (not '' and not 'Exception ignored in: <socket.socket fd=65, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>\nReso...nResourceWarning: unclosed <socket.socket fd=80, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>\n')

https://github.com/python-trio/trio/pull/1551/checks?check_run_id=720114354

njsmith commented 4 years ago

Mysterious hang in test_warn_on_failed_cancel_terminate, on the Fedora build: https://travis-ci.org/github/python-trio/trio/jobs/693376980

There aren't really any useful details in the log. I think the only place where we could hang is at the end of the nursery block waiting for the task to exit, and we do call proc.terminate(), so either the terminate failed somehow or the detect-when-subprocess-exited code failed somehow? This build is using the pidfd subprocess management code so it could potentially indicate some flakiness there? Unclear.

pquentin commented 4 years ago

I think I've seen this one a few times already.

https://github.com/python-trio/trio/pull/1574/checks?check_run_id=739572585

___________________ test_warn_deprecated_no_instead_or_issue ___________________

recwarn_always = WarningsRecorder(record=True)

    def test_warn_deprecated_no_instead_or_issue(recwarn_always):
        # Explicitly no instead or issue
        warn_deprecated("water", "1.3", issue=None, instead=None)
>       assert len(recwarn_always) == 1
E       assert 5 == 1
E         +5
E         -1

/Users/runner/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/trio/tests/test_deprecate.py:46: AssertionError
njsmith commented 4 years ago

Pytest has better warnings tools these days, we could probably fix that test pretty easy using some combination of pytest.warns and (IIRC) the .pop method that takes a type of warning to search for

On Thu, Jun 4, 2020, 11:12 Quentin Pradet notifications@github.com wrote:

I think I've seen this one a few times already.

https://github.com/python-trio/trio/pull/1574/checks?check_run_id=739572585

_ test_warn_deprecated_no_instead_orissue

recwarn_always = WarningsRecorder(record=True)

def test_warn_deprecated_no_instead_or_issue(recwarn_always):
    # Explicitly no instead or issue
    warn_deprecated("water", "1.3", issue=None, instead=None)
  assert len(recwarn_always) == 1

E assert 5 == 1 E +5 E -1

/Users/runner/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/trio/tests/test_deprecate.py:46: AssertionError

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-trio/trio/issues/200#issuecomment-639019504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEU42FIXXRHHIXPLNISY5DRU7P2BANCNFSM4DOZLF5A .

pquentin commented 4 years ago

@njsmith We're using recwarn + .pop already, which is identical to pytest.warns + .pop. I think @oremanj has found the correct fix in #1851, which I just merged.

pquentin commented 4 years ago

test_open_tcp_listeners_backlog failed again on macOS 3.8: https://github.com/python-trio/trio/pull/1575/checks?check_run_id=738751745

pquentin commented 4 years ago

New failure in the thread cache tests: https://github.com/python-trio/trio/issues/1604

njsmith commented 4 years ago

Transient failure in test_pipes on macOS, maybe just a too-aggressive timeout: https://github.com/python-trio/trio/pull/1701#issuecomment-681589841

pquentin commented 4 years ago

test_signals (as reported in #1170, it seems) failed again in #1705

pquentin commented 4 years ago

test_pipes failed again on macOS 3.8: https://github.com/python-trio/trio/pull/1713#issuecomment-686909595

njsmith commented 4 years ago

test_signals (as reported in #1170, it seems) failed again in #1705

This failure is super weird, and different from the issue in #1170. The test essentially does:

        with move_on_after(1.0) as scope:
            async with await open_process(SLEEP(3600)) as proc:
                proc.terminate()
        assert not scope.cancelled_caught
        assert proc.returncode == -SIGTERM

...and then the test fails on the last line because proc.returncode is -9 == -SIGKILL, when we were expecting -15 == SIGTERM.

So it seems like somehow, the child process is dying from a SIGKILL. How could that be? The __aexit__ from the async with proc: block can send a SIGKILL to the process, but only after await proc.wait() either returns normally or is cancelled, and then verifying that proc._proc.returncode is None:

        try:
            await self.wait()
        finally:
            if self._proc.returncode is None:
                self.kill()
                with trio.CancelScope(shield=True):
                    await self.wait()

I don't think wait could be cancelled here, because the test code does assert not scope.cancelled_caught to confirm that the timeout isn't firing. So that suggests that proc.wait must have returned after the SIGTERM, so proc._proc.returncode should have already been set to -SIGTERM. In fact, the code for wait even says assert self._proc.returncode is not None. So the finally: block in Process.aclose shouldn't have sent a SIGKILL.

So this seems to be one of those "that can't happen" errors... I don't know where this SIGKILL could be coming from.

test_pipes failed again on macOS 3.8: #1713 (comment)

I think this one is just a too-aggressive timeout: #1715

pquentin commented 4 years ago

test_sleep failure in https://github.com/python-trio/trio/pull/1722

2020-09-12T11:46:46.7648568Z 
2020-09-12T11:46:46.7648713Z     @slow
2020-09-12T11:46:46.7648917Z     async def test_sleep():
2020-09-12T11:46:46.7649137Z         async def sleep_1():
2020-09-12T11:46:46.7649425Z             await sleep_until(_core.current_time() + TARGET)
2020-09-12T11:46:46.7649860Z     
2020-09-12T11:46:46.7650124Z         await check_takes_about(sleep_1, TARGET)
2020-09-12T11:46:46.7650376Z     
2020-09-12T11:46:46.7650584Z         async def sleep_2():
2020-09-12T11:46:46.7650818Z             await sleep(TARGET)
2020-09-12T11:46:46.7651032Z     
2020-09-12T11:46:46.7651302Z >       await check_takes_about(sleep_2, TARGET)
2020-09-12T11:46:46.7651518Z 
2020-09-12T11:46:46.7652356Z C:\hostedtoolcache\windows\Python\3.8.5\x64\lib\site-packages\trio\tests\test_timeouts.py:54: 
2020-09-12T11:46:46.7652875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2020-09-12T11:46:46.7653051Z 
2020-09-12T11:46:46.7653400Z f = <function test_sleep.<locals>.sleep_2 at 0x000001BD6DA138B0>
2020-09-12T11:46:46.7653775Z expected_dur = 1.0
2020-09-12T11:46:46.7653996Z 
2020-09-12T11:46:46.7654303Z     async def check_takes_about(f, expected_dur):
2020-09-12T11:46:46.7654688Z         start = time.perf_counter()
2020-09-12T11:46:46.7655086Z         result = await outcome.acapture(f)
2020-09-12T11:46:46.7655492Z         dur = time.perf_counter() - start
2020-09-12T11:46:46.7655816Z         print(dur / expected_dur)
2020-09-12T11:46:46.7656243Z         # 1.5 is an arbitrary fudge factor because there's always some delay
2020-09-12T11:46:46.7656782Z         # between when we become eligible to wake up and when we actually do. We
2020-09-12T11:46:46.7657330Z         # used to sleep for 0.05, and regularly observed overruns of 1.6x on
2020-09-12T11:46:46.7657881Z         # Appveyor, and then started seeing overruns of 2.3x on Travis's macOS, so
2020-09-12T11:46:46.7658424Z         # now we bumped up the sleep to 1 second, marked the tests as slow, and
2020-09-12T11:46:46.7658950Z         # hopefully now the proportional error will be less huge.
2020-09-12T11:46:46.7659304Z         #
2020-09-12T11:46:46.7659687Z         # We also also for durations that are a hair shorter than expected. For
2020-09-12T11:46:46.7660219Z         # example, here's a run on Windows where a 1.0 second sleep was measured
2020-09-12T11:46:46.7660635Z         # to take 0.9999999999999858 seconds:
2020-09-12T11:46:46.7661220Z         #   https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21
2020-09-12T11:46:46.7661934Z         # I believe that what happened here is that Windows's low clock resolution
2020-09-12T11:46:46.7662524Z         # meant that our calls to time.monotonic() returned exactly the same
2020-09-12T11:46:46.7663101Z         # values as the calls inside the actual run loop, but the two subtractions
2020-09-12T11:46:46.7663758Z         # returned slightly different values because the run loop's clock adds a
2020-09-12T11:46:46.7664297Z         # random floating point offset to both times, which should cancel out, but
2020-09-12T11:46:46.7664799Z         # lol floating point we got slightly different rounding errors. (That
2020-09-12T11:46:46.7665276Z         # value above is exactly 128 ULPs below 1.0, which would make sense if it
2020-09-12T11:46:46.7665694Z         # started as a 1 ULP error at a different dynamic range.)
2020-09-12T11:46:46.7666054Z >       assert (1 - 1e-8) <= (dur / expected_dur) < 1.5
2020-09-12T11:46:46.7666318Z E       assert (1.5026676000000023 / 1.0) < 1.5
2020-09-12T11:46:46.7666477Z 
2020-09-12T11:46:46.7667579Z C:\hostedtoolcache\windows\Python\3.8.5\x64\lib\site-packages\trio\tests\test_timeouts.py:35: AssertionError
2020-09-12T11:46:46.7668130Z ---------------------------- Captured stdout call -----------------------------
2020-09-12T11:46:46.7668367Z 1.0003469999999979
2020-09-12T11:46:46.7668537Z 1.5026676000000023
pquentin commented 4 years ago

macOS 3.8 test_interactive failure in https://github.com/python-trio/trio/pull/1726#issuecomment-692479386

pquentin commented 4 years ago

Another macOS 3.8 test_interactive failure in https://github.com/python-trio/trio/pull/1732/checks?check_run_id=1143057289

pquentin commented 4 years ago

Yet another macOS 3.8 test_interactive failure in https://github.com/python-trio/trio/pull/1733/checks?check_run_id=1175835741.

I think Python 3.8 changed subprocess handling for macOS (and Linux), but I don't know the details nor why those changes would affect Trio.

pquentin commented 4 years ago

macOS 3.7 interactive failure: https://github.com/python-trio/trio/pull/1747/checks?check_run_id=1208098695

So at least we know that this isn't specific to 3.8

belm0 commented 4 years ago

timeout on Windows builds occurring often again

https://github.com/python-trio/trio/issues/200#issuecomment-621442601

belm0 commented 3 years ago

"Windows (3.8, x64, with IFS LSP)" test is often running endlessly with this loop:

+ netsh winsock show catalog
+ grep 'Layered Chain Entry'
+ sleep 1
+ netsh winsock show catalog
+ grep 'Layered Chain Entry'
+ sleep 1