Open njsmith opened 7 years ago
Regarding the weird pypy nightly freeze in test_local: I downloaded the pypy-c-jit-91601-609a3cdf9cf7-linux64 nightly, and have let it loop running the trio testsuite for the last a few hours on my laptop, and I haven't been able to reproduce the problem so far. (Though I did get two failures in test_ki.py
, both on this line, which is getting "did not raise KeyboardInterrupt". The test sessions otherwise finished normally.)
Got fed up and fixed #140 :-)
Freeze in test_local.py::test_run_local_simultaneous_runs
on pypy3 5.8 – maybe just travis being flaky, maybe something more.
Here's a weird one: https://travis-ci.org/python-trio/trio/jobs/298164123
It looks like our test for CPython 3.6.2 on MacOS, one of our calls to the synchronous, stdlib function SSLSocket.unwrap()
raised an SSLWantWriteError
. Which should be impossible for a synchronous call, I think? Maybe this is some weird intermittent bug in the stdlib ssl module?
Here's a new one, I guess somehow introduced by #358: a timeout test failing on windows because a 1 second sleep is measured to take just a tinnnnny bit less than 1 second. https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21
The weird SSL failure happened again: https://travis-ci.org/python-trio/trio/jobs/311618077 Again on CPython 3.6 on MacOS.
Filed upstream as bpo-32219. Possibly for now we should ignore SSLWantWriteError
there as a workaround.
Edit: #365 is the PR for ignoring it.
Another freeze on PyPy nightly in tests/test_local.py::test_run_local_simultaneous_runs
: https://travis-ci.org/python-trio/trio/jobs/319497598
Same thing happened on Sept. 7, above: https://github.com/python-trio/trio/issues/200#issuecomment-327711926 And back in June: https://github.com/python-trio/trio/issues/200#issuecomment-309187983
Filed a bug: #379
There was a mysterious appveyor build failure here: https://github.com/python-trio/trio/pull/535
Strange PyPy nightly failure: https://travis-ci.org/python-trio/trio/jobs/391945139
Since it happened on master, I can't close/reopen a PR, but restarting the job produced the same effects.
(I think someone restarted the job above and it finally worked: the job is now green.)
Jenkins keeps eating pull requests with a segfault (but only sometimes). Looks like a bug in immutable library - but can't reproduce it locally, and I don't know how to get the core dump.
Here's a log with the segfault on Jenkins: https://ci.cryptography.io/blue/rest/organizations/jenkins/pipelines/python-trio/pipelines/trio/branches/PR-575/runs/2/nodes/6/steps/33/log/?start=0
The crash-handler traceback shows it as happening on line 27 of contextvars/__init__.py
, which seems to be:
self._data = immutables.Map()
And immutables.Map
is a complex data structure implemented in C, so agreed that this smells like a bug in that code.
Filed a bug upstream here: https://github.com/MagicStack/immutables/issues/7
The Jenkins thing has been worked around by #583... but now that we're running MacOS tests on Travis, we get #584 instead.
a segfault in pypy 3.6 nightly, apparently related to the faulthandler timeout firing in trio/tests/test_ssl.py::test_renegotiation_randomized
: https://travis-ci.org/python-trio/trio/jobs/433409577
Reported it on the #pypy irc channel anyway, though there's not much to go on yet
Another strange pypy 3.6 nightly faulthandler traceback: https://travis-ci.org/python-trio/trio/jobs/436962955
I don't really understand this one at all.
An appveyor failure, in test_open_tcp_listeners_backlog
: https://ci.appveyor.com/project/njsmith/trio/builds/21430330/job/orp078dryaqrninr
Apparently we set the nominal backlog to 1 and the measured backlog was 15, while we assert that it's <= 10
.
Edit (2019-02-05): another failure in the backlog test, this time on macOS: https://dev.azure.com/python-trio/trio/_build/results?buildId=150&view=logs
We asked for 11, but got 5.
A failure in test_subprocess.test_signals
on macOS: #851
A one second sleep took 1.39 seconds on azure pipelines's macos py35: https://dev.azure.com/python-trio/trio/_build/results?buildId=598&view=logs
I guess we need to bump up that threshold...
Azure Pipelines macOS 3.5 strikes again: https://dev.azure.com/python-trio/trio/_build/results?buildId=806&view=logs&jobId=93056758-5bfb-5750-f113-e720ddefdb4c
Here we tried sleeping for 1 second, and the process actually slept for >2 seconds. [Edit: Actually we tried sleeping for 0.1 second, and the process actually slept for >0.2 seconds.] So our automatic retry process kicked in... and it happened 4 times in a row, so the test still failed.
Maybe we can make those sleep tests more robust by not just retrying, but first sleeping a bit before retrying, in order to give any noisy neighbors a chance to settle down?
I've seen https://github.com/python-trio/trio/issues/1170 again.
+ mkdir empty
mkdir: cannot create directory 'empty': File exists
I see this all the time locally when running ci.sh multiple times, but the empty
directory is not supposed to exist in our continuous integration. Maybe there's some sort of cache?
mkdir: cannot create directory 'empty': File exists
Uh... there isn't supposed to be any kind of cache... azure does have a caching feature, but you have to enable it explicitly and tell it to cache specific things; it shouldn't be caching random files inside the source checkout. This looks super broken.
I guess we could attempt to debug it by adding some ls
calls and waiting to see if it happens again. But if this is also causing trouble for you locally, maybe the pragmatic thing would be to make it rm -rf empty; mkdir empty
and not worry about it?
mkdir: cannot create directory 'empty': File exists
strikes again: https://github.com/python-trio/trio/pull/1396#issuecomment-588316930
What if we change it to mkdir -p
?
-p, --parents
no error if existing, make parent directories as needed
Sure, that might work too. We're just flailing in the dark here, so it's probably too optimistic to hope we can make rational fine-grained choices between similar options :-)
Recently hit on Github Actions, Windows (3.8, x64, with non-IFS LSP)
:
2020-04-29T19:02:16.5149922Z ================================== FAILURES ===================================
2020-04-29T19:02:16.5150729Z ____________________ test_WaitForMultipleObjects_sync_slow ____________________
2020-04-29T19:02:16.5150934Z
2020-04-29T19:02:16.5151963Z @slow
2020-04-29T19:02:16.5152264Z async def test_WaitForMultipleObjects_sync_slow():
2020-04-29T19:02:16.5152519Z # This does a series of test in which the main thread sync-waits for
2020-04-29T19:02:16.5153005Z # handles, while we spawn a thread to set the handles after a short while.
2020-04-29T19:02:16.5153264Z
2020-04-29T19:02:16.5153831Z TIMEOUT = 0.3
2020-04-29T19:02:16.5154042Z
2020-04-29T19:02:16.5154267Z # One handle
2020-04-29T19:02:16.5154463Z handle1 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5154710Z t0 = _core.current_time()
2020-04-29T19:02:16.5154947Z async with _core.open_nursery() as nursery:
2020-04-29T19:02:16.5155174Z nursery.start_soon(
2020-04-29T19:02:16.5155411Z trio.to_thread.run_sync, WaitForMultipleObjects_sync, handle1
2020-04-29T19:02:16.5155659Z )
2020-04-29T19:02:16.5156204Z await _timeouts.sleep(TIMEOUT)
2020-04-29T19:02:16.5156565Z # If we would comment the line below, the above thread will be stuck,
2020-04-29T19:02:16.5156788Z # and Trio won't exit this scope
2020-04-29T19:02:16.5157024Z kernel32.SetEvent(handle1)
2020-04-29T19:02:16.5157228Z t1 = _core.current_time()
2020-04-29T19:02:16.5157443Z assert TIMEOUT <= (t1 - t0) < 2.0 * TIMEOUT
2020-04-29T19:02:16.5157672Z kernel32.CloseHandle(handle1)
2020-04-29T19:02:16.5157902Z print('test_WaitForMultipleObjects_sync_slow one OK')
2020-04-29T19:02:16.5158129Z
2020-04-29T19:02:16.5158349Z # Two handles, signal first
2020-04-29T19:02:16.5158722Z handle1 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5159122Z handle2 = kernel32.CreateEventA(ffi.NULL, True, False, ffi.NULL)
2020-04-29T19:02:16.5159386Z t0 = _core.current_time()
2020-04-29T19:02:16.5159658Z async with _core.open_nursery() as nursery:
2020-04-29T19:02:16.5159891Z nursery.start_soon(
2020-04-29T19:02:16.5160368Z trio.to_thread.run_sync, WaitForMultipleObjects_sync, handle1,
2020-04-29T19:02:16.5160603Z handle2
2020-04-29T19:02:16.5160837Z )
2020-04-29T19:02:16.5161061Z await _timeouts.sleep(TIMEOUT)
2020-04-29T19:02:16.5161320Z kernel32.SetEvent(handle1)
2020-04-29T19:02:16.5161549Z t1 = _core.current_time()
2020-04-29T19:02:16.5161867Z > assert TIMEOUT <= (t1 - t0) < 2.0 * TIMEOUT
2020-04-29T19:02:16.5162633Z E assert (70981.67862983143 - 70980.60035523142) < (2.0 * 0.3)
2020-04-29T19:02:16.5162845Z
2020-04-29T19:02:16.5163600Z C:\hostedtoolcache\windows\Python\3.8.2\x64\Lib\site-packages\trio\tests\test_wait_for_object.py:106: AssertionError
2020-04-29T19:02:16.5164689Z ---------------------------- Captured stdout call -----------------------------
2020-04-29T19:02:16.5164985Z test_WaitForMultipleObjects_sync_slow one OK
2020-04-29T19:02:16.5173571Z ============================== warnings summary ===============================
Here's an interesting example of a flaky test: test_run_in_worker_thread_abandoned
wants to make sure that a background thread didn't spew any traceback on the console. But, it gets a false positive because of random ResourceWarnings being spewed on the console at just the wrong time:
# Make sure we don't have a "Exception in thread ..." dump to the console:
out, err = capfd.readouterr()
> assert not out and not err
E AssertionError: assert (not '' and not 'Exception ignored in: <socket.socket fd=65, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>\nReso...nResourceWarning: unclosed <socket.socket fd=80, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>\n')
https://github.com/python-trio/trio/pull/1551/checks?check_run_id=720114354
Mysterious hang in test_warn_on_failed_cancel_terminate
, on the Fedora build: https://travis-ci.org/github/python-trio/trio/jobs/693376980
There aren't really any useful details in the log. I think the only place where we could hang is at the end of the nursery block waiting for the task to exit, and we do call proc.terminate()
, so either the terminate
failed somehow or the detect-when-subprocess-exited code failed somehow? This build is using the pidfd subprocess management code so it could potentially indicate some flakiness there? Unclear.
I think I've seen this one a few times already.
https://github.com/python-trio/trio/pull/1574/checks?check_run_id=739572585
___________________ test_warn_deprecated_no_instead_or_issue ___________________
recwarn_always = WarningsRecorder(record=True)
def test_warn_deprecated_no_instead_or_issue(recwarn_always):
# Explicitly no instead or issue
warn_deprecated("water", "1.3", issue=None, instead=None)
> assert len(recwarn_always) == 1
E assert 5 == 1
E +5
E -1
/Users/runner/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/trio/tests/test_deprecate.py:46: AssertionError
Pytest has better warnings tools these days, we could probably fix that test pretty easy using some combination of pytest.warns and (IIRC) the .pop method that takes a type of warning to search for
On Thu, Jun 4, 2020, 11:12 Quentin Pradet notifications@github.com wrote:
I think I've seen this one a few times already.
https://github.com/python-trio/trio/pull/1574/checks?check_run_id=739572585
_ test_warn_deprecated_no_instead_orissue
recwarn_always = WarningsRecorder(record=True)
def test_warn_deprecated_no_instead_or_issue(recwarn_always): # Explicitly no instead or issue warn_deprecated("water", "1.3", issue=None, instead=None)
assert len(recwarn_always) == 1
E assert 5 == 1 E +5 E -1
/Users/runner/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/trio/tests/test_deprecate.py:46: AssertionError
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/python-trio/trio/issues/200#issuecomment-639019504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEU42FIXXRHHIXPLNISY5DRU7P2BANCNFSM4DOZLF5A .
@njsmith We're using recwarn
+ .pop
already, which is identical to pytest.warns
+ .pop
. I think @oremanj has found the correct fix in #1851, which I just merged.
test_open_tcp_listeners_backlog failed again on macOS 3.8: https://github.com/python-trio/trio/pull/1575/checks?check_run_id=738751745
New failure in the thread cache tests: https://github.com/python-trio/trio/issues/1604
Transient failure in test_pipes
on macOS, maybe just a too-aggressive timeout: https://github.com/python-trio/trio/pull/1701#issuecomment-681589841
test_signals (as reported in #1170, it seems) failed again in #1705
test_pipes failed again on macOS 3.8: https://github.com/python-trio/trio/pull/1713#issuecomment-686909595
test_signals (as reported in #1170, it seems) failed again in #1705
This failure is super weird, and different from the issue in #1170. The test essentially does:
with move_on_after(1.0) as scope:
async with await open_process(SLEEP(3600)) as proc:
proc.terminate()
assert not scope.cancelled_caught
assert proc.returncode == -SIGTERM
...and then the test fails on the last line because proc.returncode
is -9 == -SIGKILL, when we were expecting -15 == SIGTERM.
So it seems like somehow, the child process is dying from a SIGKILL. How could that be? The __aexit__
from the async with proc:
block can send a SIGKILL to the process, but only after await proc.wait()
either returns normally or is cancelled, and then verifying that proc._proc.returncode is None
:
try:
await self.wait()
finally:
if self._proc.returncode is None:
self.kill()
with trio.CancelScope(shield=True):
await self.wait()
I don't think wait
could be cancelled here, because the test code does assert not scope.cancelled_caught
to confirm that the timeout isn't firing. So that suggests that proc.wait
must have returned after the SIGTERM, so proc._proc.returncode
should have already been set to -SIGTERM. In fact, the code for wait
even says assert self._proc.returncode is not None
. So the finally:
block in Process.aclose
shouldn't have sent a SIGKILL.
So this seems to be one of those "that can't happen" errors... I don't know where this SIGKILL could be coming from.
test_pipes failed again on macOS 3.8: #1713 (comment)
I think this one is just a too-aggressive timeout: #1715
test_sleep
failure in https://github.com/python-trio/trio/pull/1722
2020-09-12T11:46:46.7648568Z
2020-09-12T11:46:46.7648713Z @slow
2020-09-12T11:46:46.7648917Z async def test_sleep():
2020-09-12T11:46:46.7649137Z async def sleep_1():
2020-09-12T11:46:46.7649425Z await sleep_until(_core.current_time() + TARGET)
2020-09-12T11:46:46.7649860Z
2020-09-12T11:46:46.7650124Z await check_takes_about(sleep_1, TARGET)
2020-09-12T11:46:46.7650376Z
2020-09-12T11:46:46.7650584Z async def sleep_2():
2020-09-12T11:46:46.7650818Z await sleep(TARGET)
2020-09-12T11:46:46.7651032Z
2020-09-12T11:46:46.7651302Z > await check_takes_about(sleep_2, TARGET)
2020-09-12T11:46:46.7651518Z
2020-09-12T11:46:46.7652356Z C:\hostedtoolcache\windows\Python\3.8.5\x64\lib\site-packages\trio\tests\test_timeouts.py:54:
2020-09-12T11:46:46.7652875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
2020-09-12T11:46:46.7653051Z
2020-09-12T11:46:46.7653400Z f = <function test_sleep.<locals>.sleep_2 at 0x000001BD6DA138B0>
2020-09-12T11:46:46.7653775Z expected_dur = 1.0
2020-09-12T11:46:46.7653996Z
2020-09-12T11:46:46.7654303Z async def check_takes_about(f, expected_dur):
2020-09-12T11:46:46.7654688Z start = time.perf_counter()
2020-09-12T11:46:46.7655086Z result = await outcome.acapture(f)
2020-09-12T11:46:46.7655492Z dur = time.perf_counter() - start
2020-09-12T11:46:46.7655816Z print(dur / expected_dur)
2020-09-12T11:46:46.7656243Z # 1.5 is an arbitrary fudge factor because there's always some delay
2020-09-12T11:46:46.7656782Z # between when we become eligible to wake up and when we actually do. We
2020-09-12T11:46:46.7657330Z # used to sleep for 0.05, and regularly observed overruns of 1.6x on
2020-09-12T11:46:46.7657881Z # Appveyor, and then started seeing overruns of 2.3x on Travis's macOS, so
2020-09-12T11:46:46.7658424Z # now we bumped up the sleep to 1 second, marked the tests as slow, and
2020-09-12T11:46:46.7658950Z # hopefully now the proportional error will be less huge.
2020-09-12T11:46:46.7659304Z #
2020-09-12T11:46:46.7659687Z # We also also for durations that are a hair shorter than expected. For
2020-09-12T11:46:46.7660219Z # example, here's a run on Windows where a 1.0 second sleep was measured
2020-09-12T11:46:46.7660635Z # to take 0.9999999999999858 seconds:
2020-09-12T11:46:46.7661220Z # https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21
2020-09-12T11:46:46.7661934Z # I believe that what happened here is that Windows's low clock resolution
2020-09-12T11:46:46.7662524Z # meant that our calls to time.monotonic() returned exactly the same
2020-09-12T11:46:46.7663101Z # values as the calls inside the actual run loop, but the two subtractions
2020-09-12T11:46:46.7663758Z # returned slightly different values because the run loop's clock adds a
2020-09-12T11:46:46.7664297Z # random floating point offset to both times, which should cancel out, but
2020-09-12T11:46:46.7664799Z # lol floating point we got slightly different rounding errors. (That
2020-09-12T11:46:46.7665276Z # value above is exactly 128 ULPs below 1.0, which would make sense if it
2020-09-12T11:46:46.7665694Z # started as a 1 ULP error at a different dynamic range.)
2020-09-12T11:46:46.7666054Z > assert (1 - 1e-8) <= (dur / expected_dur) < 1.5
2020-09-12T11:46:46.7666318Z E assert (1.5026676000000023 / 1.0) < 1.5
2020-09-12T11:46:46.7666477Z
2020-09-12T11:46:46.7667579Z C:\hostedtoolcache\windows\Python\3.8.5\x64\lib\site-packages\trio\tests\test_timeouts.py:35: AssertionError
2020-09-12T11:46:46.7668130Z ---------------------------- Captured stdout call -----------------------------
2020-09-12T11:46:46.7668367Z 1.0003469999999979
2020-09-12T11:46:46.7668537Z 1.5026676000000023
macOS 3.8 test_interactive
failure in https://github.com/python-trio/trio/pull/1726#issuecomment-692479386
Another macOS 3.8 test_interactive
failure in https://github.com/python-trio/trio/pull/1732/checks?check_run_id=1143057289
Yet another macOS 3.8 test_interactive
failure in https://github.com/python-trio/trio/pull/1733/checks?check_run_id=1175835741.
I think Python 3.8 changed subprocess handling for macOS (and Linux), but I don't know the details nor why those changes would affect Trio.
macOS 3.7 interactive failure: https://github.com/python-trio/trio/pull/1747/checks?check_run_id=1208098695
So at least we know that this isn't specific to 3.8
timeout on Windows builds occurring often again
https://github.com/python-trio/trio/issues/200#issuecomment-621442601
"Windows (3.8, x64, with IFS LSP)" test is often running endlessly with this loop:
+ netsh winsock show catalog
+ grep 'Layered Chain Entry'
+ sleep 1
+ netsh winsock show catalog
+ grep 'Layered Chain Entry'
+ sleep 1
It's the nature of I/O libraries like Trio that their test suites are prone to weird intermittent failures. But they're often hard to track down, and usually the way you encounter them is that you're trying to land some other unrelated feature and the CI randomly fails, so the temptation is to click "re-run build" and worry about it later.
This temptation must be resisted. If left unchecked, you eventually end up with tests that fail all the time for unknown reasons and no-one trusts them and it's this constant drag on development. Flaky tests must be eradicated.
But to make things extra fun, there's another problem: CI services genuinely are a bit flaky, so when you see a weird failure or lock-up in the tests then it's often unclear whether this is a bug in our code, or just some cloud provider having indigestion. And you don't want to waste hours trying to reproduce indigestion. Which means we need to compare notes across multiple failures. Which is tricky when I see one failure, and you see another, and neither of us realizes that we're seeing the same thing. Hence: this issue.
What to do if you see a weird test failure that makes no sense:
Visit this bug; it's #200 so it's hopefully easy to remember.
Check to see if anyone else has reported the same failure
Either way, add a note recording what you saw. Make sure to link to the failed test log.
Special notes for specific CI services:
Issues we're monitoring currently
test_open_tcp_listeners_backlog
: https://github.com/python-trio/trio/issues/200#issuecomment-451885495 (last seen: all the time, should be fixed by #1601)1277 (last seen: October 2019)
851 (last seen: January 2019)