Open hroncok opened 1 month ago
I wanted to cc @CendioOssman in the report but forgot.
What the test tries to do is to fill the kernel socket buffers so that select()
stops signalling that a send()
is possible.
What we do to achieve this is set an explicit send and receive buffer size, to force the kernel to stop dynamically resizing them. We then ask the kernel what the real size is (since it will do funky stuff like double the requested size), and then write that much data in the buffer. The assertNotEqual()
then checks that we've been forced to do some local buffering, as the kernel is full. If this hasn't happened, the test would otherwise give a false positive.
No idea why this isn't working in Fedora's CI. And it's difficult to debug if it only happens there. :/
Perhaps you do some creative fiddling with the network settings that forces dynamic buffer scaling to stay on?
Or is there some virtualization/containerization that results in the system lying to us?
To debug, could you add some output of getsockopt(SO_SNDBUF)
, getsockopt(SO_RCVBUF)
, ioctl(SIOCINQ)
, and ioctl(SIOCOUTQ)
?
Thank you! The system is 100% virtualized/containerized. It runs on the Tetsing Farm and I don't know much about it, but perhaps @thrix might know what is happening or how to easily reproduce this locally.
To debug, could you add some output of
getsockopt(SO_SNDBUF)
,getsockopt(SO_RCVBUF)
,ioctl(SIOCINQ)
, andioctl(SIOCOUTQ)
?
Is this C? My Python has no socket.getsockopt, even thou it is documented at https://docs.python.org/3/library/socket.html#socket.socket.getsockopt
Oh, I have getsockopt on socket objects, but it takes 2 arguments (level and option).
So I can do something like his and run it on the CI (outputs from my machine):
>>> import socket
>>> s = socket.socket()
>>> s.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
16384
>>> s.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
131072
But for ioctl, I am out of my depth. I've tried something like fcntl.ioctl(s, termios.TIOCOUTQ)
(as TIOCOUTQ is SIOCOUTQ in my /usr/include/linux/sockios.h
), but I get OSError: [Errno 14] Bad address
.
We seem to have exactly the same here when packaging 3.13.0rc1 on openSUSE/Tumbleweed, build log. Is it possible that the chrooted environment (which is what both us and Fedora have) misses something expected by this test?
I'm also seeing this sometimes when testing on Gentoo amd64, in systemd-nspawn container. However, the test is only flaky here — generally it fails when the system is busy, but passes when Python's test suite has all the CPU to itself.
This now started failing in the Fedora build system as well.
This now started failing in the Fedora build system as well.
Presumably when our builders were kernel-updated to 6.10.4 and/or 6.10.5.
Indeed. I just rebooted from kernel 6.8.9 to 6.10.6 and I can reproduce the failure locally on Fedora Linux 39:
$ uname -a
Linux carbon 6.10.6-100.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 19 14:35:32 UTC 2024 x86_64 GNU/Linux
$ python3.13 -m test test_asyncio -wW -j8
Using random seed: 1988926982
0:00:00 load avg: 0.41 Run 31 tests in parallel using 8 worker processes
0:00:00 load avg: 0.41 [ 1/31] test_asyncio.test_buffered_proto passed
0:00:00 load avg: 0.41 [ 2/31] test_asyncio.test_futures2 passed
0:00:00 load avg: 0.41 [ 3/31] test_asyncio.test_context passed
0:00:00 load avg: 0.41 [ 4/31] test_asyncio.test_pep492 passed
0:00:00 load avg: 0.41 [ 5/31] test_asyncio.test_protocols passed
0:00:00 load avg: 0.41 [ 6/31] test_asyncio.test_proactor_events passed
0:00:00 load avg: 1.02 [ 7/31] test_asyncio.test_selector_events passed
0:00:00 load avg: 1.02 [ 8/31] test_asyncio.test_queues passed
0:00:00 load avg: 1.02 [ 9/31] test_asyncio.test_eager_task_factory passed
0:00:00 load avg: 1.02 [10/31] test_asyncio.test_runners passed
0:00:01 load avg: 1.02 [11/31] test_asyncio.test_locks passed
0:00:01 load avg: 1.02 [12/31] test_asyncio.test_base_events passed
0:00:01 load avg: 1.02 [13/31/1] test_asyncio.test_server failed (1 failure)
test_start_server_1 (test.test_asyncio.test_server.ProactorStartServerTests.test_start_server_1) ... skipped 'Windows only'
test_start_server_1 (test.test_asyncio.test_server.SelectorStartServerTests.test_start_server_1) ... ok
test_start_unix_server_1 (test.test_asyncio.test_server.SelectorStartServerTests.test_start_unix_server_1) ... ok
test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients) ... FAIL
test_close_clients (test.test_asyncio.test_server.TestServer2.test_close_clients) ... ok
test_wait_closed_basic (test.test_asyncio.test_server.TestServer2.test_wait_closed_basic) ... ok
test_wait_closed_race (test.test_asyncio.test_server.TestServer2.test_wait_closed_race) ... ok
test_unix_server_addr_cleanup (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_addr_cleanup) ... ok
test_unix_server_cleanup_gone (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_cleanup_gone) ... ok
test_unix_server_cleanup_prevented (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_cleanup_prevented) ... ok
test_unix_server_cleanup_replaced (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_cleanup_replaced) ... ok
test_unix_server_sock_cleanup (test.test_asyncio.test_server.UnixServerCleanupTests.test_unix_server_sock_cleanup) ... ok
======================================================================
FAIL: test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib64/python3.13/unittest/async_case.py", line 93, in _callTestMethod
if self._callMaybeAsync(method) is not None:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/usr/lib64/python3.13/unittest/async_case.py", line 115, in _callMaybeAsync
return self._asyncioRunner.run(
~~~~~~~~~~~~~~~~~~~~~~~^
func(*args, **kwargs),
^^^^^^^^^^^^^^^^^^^^^^
context=self._asyncioTestContext,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 721, in run_until_complete
return future.result()
~~~~~~~~~~~~~^^
File "/usr/lib64/python3.13/test/test_asyncio/test_server.py", line 249, in test_abort_clients
self.assertNotEqual(s_wr.transport.get_write_buffer_size(), 0)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 0 == 0
----------------------------------------------------------------------
Ran 12 tests in 0.244s
FAILED (failures=1, skipped=1)
test test_asyncio.test_server failed
0:00:01 load avg: 1.02 [14/31/1] test_asyncio.test_futures passed
0:00:01 load avg: 1.02 [15/31/1] test_asyncio.test_staggered passed
0:00:01 load avg: 1.02 [16/31/1] test_asyncio.test_sendfile passed
0:00:02 load avg: 1.02 [17/31/1] test_asyncio.test_streams passed
0:00:02 load avg: 1.02 [18/31/1] test_asyncio.test_threads passed
0:00:02 load avg: 1.02 [19/31/1] test_asyncio.test_timeouts passed
0:00:02 load avg: 1.02 [20/31/1] test_asyncio.test_transports passed
0:00:02 load avg: 1.02 [21/31/1] test_asyncio.test_sock_lowlevel passed
0:00:03 load avg: 1.02 [22/31/1] test_asyncio.test_unix_events passed
0:00:03 load avg: 1.02 [23/31/1] test_asyncio.test_windows_events skipped
test_asyncio.test_windows_events skipped -- Windows only
0:00:03 load avg: 1.02 [24/31/1] test_asyncio.test_windows_utils skipped
test_asyncio.test_windows_utils skipped -- Windows only
0:00:03 load avg: 1.02 [25/31/1] test_asyncio.test_waitfor passed
0:00:05 load avg: 1.02 [26/31/1] test_asyncio.test_sslproto passed
0:00:06 load avg: 1.10 [27/31/1] test_asyncio.test_events passed
0:00:07 load avg: 1.10 [28/31/1] test_asyncio.test_taskgroups passed
0:00:08 load avg: 1.10 [29/31/1] test_asyncio.test_ssl passed
0:00:16 load avg: 2.75 [30/31/1] test_asyncio.test_subprocess passed
0:00:18 load avg: 2.75 [31/31/1] test_asyncio.test_tasks passed
== Tests result: FAILURE ==
2 tests skipped:
test_asyncio.test_windows_events test_asyncio.test_windows_utils
1 test failed:
test_asyncio.test_server
28 tests OK.
0:00:18 load avg: 2.75 Re-running 1 failed tests in verbose mode in subprocesses
0:00:18 load avg: 2.75 Run 1 test in parallel using 1 worker process
0:00:18 load avg: 2.75 [1/1/1] test_asyncio.test_server failed (1 failure)
Re-running test_asyncio.test_server in verbose mode (matching: test_abort_clients)
test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients) ... FAIL
======================================================================
FAIL: test_abort_clients (test.test_asyncio.test_server.TestServer2.test_abort_clients)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib64/python3.13/unittest/async_case.py", line 93, in _callTestMethod
if self._callMaybeAsync(method) is not None:
~~~~~~~~~~~~~~~~~~~~^^^^^^^^
File "/usr/lib64/python3.13/unittest/async_case.py", line 115, in _callMaybeAsync
return self._asyncioRunner.run(
~~~~~~~~~~~~~~~~~~~~~~~^
func(*args, **kwargs),
^^^^^^^^^^^^^^^^^^^^^^
context=self._asyncioTestContext,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 721, in run_until_complete
return future.result()
~~~~~~~~~~~~~^^
File "/usr/lib64/python3.13/test/test_asyncio/test_server.py", line 249, in test_abort_clients
self.assertNotEqual(s_wr.transport.get_write_buffer_size(), 0)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 0 == 0
----------------------------------------------------------------------
Ran 1 test in 0.017s
FAILED (failures=1)
test test_asyncio.test_server failed
1 test failed again:
test_asyncio.test_server
== Tests result: FAILURE then FAILURE ==
2 tests skipped:
test_asyncio.test_windows_events test_asyncio.test_windows_utils
1 re-run test:
test_asyncio.test_server
1 test failed:
test_asyncio.test_server
28 tests OK.
Total duration: 18.5 sec
Total tests: run=2,561 failures=2 skipped=18
Total test files: run=32/31 failed=1 skipped=2 rerun=1
Result: FAILURE then FAILURE
(I could not reproduce the failure with Linux 6.8.9.)
From man socket
:
SO_RCVBUF Sets or gets the maximum socket receive buffer in bytes. The kernel doubles this value (to allow space for bookkeeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/rmem_default file, and the maximum allowed value is set by the /proc/sys/net/core/rmem_max file. The minimum (doubled) value for this option is 256.
So, the buffer size reported by getsockopt
is twice the size requested by setsockopt
, but the actual buffer size is less (for the “bookkeeping overhead”).
Testing with the Python socket
module:
import os
import socket
print(os.uname().sysname, os.uname().release)
HOST = socket.gethostname()
PORT = 12345
l_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
l_sock.bind((HOST, PORT))
l_sock.listen(5)
c_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
c_sock.connect((HOST, PORT))
s_sock, c_addr = l_sock.accept()
BUFSIZE_REQUEST = 65536
s_sock.setblocking(False)
c_sock.setblocking(False)
s_sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, BUFSIZE_REQUEST)
c_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, BUFSIZE_REQUEST)
bufsize_s = s_sock.getsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF)
bufsize_c = c_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)
send_size = min(bufsize_s, bufsize_c)
print(f'{BUFSIZE_REQUEST=} {bufsize_s=} {bufsize_c=}')
for i in range(50):
try:
sent = c_sock.send(b'a' * send_size)
except BlockingIOError:
sent = None
print(f'iteration {i}: {sent=}/{send_size} bytes')
if sent is None:
break
c_sock.close()
s_sock.close()
l_sock.close()
This has different behaviour between the kernel versions.
With kernel 6.5.x, the buffer fills up with a little more than the buffer size reported by getsockopt
, as one would expect from what the manpage says:
Linux 6.5.6-300.fc39.x86_64
BUFSIZE_REQUEST=65536 bufsize_s=131072 bufsize_c=131072
iteration 0: sent=130964/131072 bytes
iteration 1: sent=32741/131072 bytes
iteration 2: sent=None/131072 bytes
But with 6.10.6, the buffers can hold almost five times the reported buffer size:
Linux 6.10.6-200.fc40.x86_64
BUFSIZE_REQUEST=65536 bufsize_s=131072 bufsize_c=131072
iteration 0: sent=131072/131072 bytes
iteration 1: sent=131072/131072 bytes
iteration 2: sent=131072/131072 bytes
iteration 3: sent=131072/131072 bytes
iteration 4: sent=130964/131072 bytes
iteration 5: sent=None/131072 bytes
The new kernel is being quite generous! I didn't find documentation/commit for this change.
I guess the way to reliably fill the buffers is to keep sending until the kernel doesn't accept any more data, rather than rely on the numbers...
Now the buildbots are fixed, but I'd like to try to find some more info about what's happening so I'll keep the issue open.
Actually, I see suddenly all openSUSE builds on PPC64LE failing because of test_sendfile_close_peer_in_the_middle_of_receiving
in three classes of test.test_asyncio.test_sendfile.py
. Given that it has a long comment about failing on Solaris because of problems with socket buffers, I wonder, whether it is not related.
Complete build log of Python 3.13.0rc1
Or perhaps it is another revelation of https://github.com/python/cpython/issues/85848 (aka https://github.com/python/cpython/issues/110325)?
Actually, I see suddenly all openSUSE builds on PPC64LE failing because of
test_sendfile_close_peer_in_the_middle_of_receiving
in three classes oftest.test_asyncio.test_sendfile.py
...
Python 3.12 is not affected: the test was added to Python 3.13.
Bug report
Bug description:
Hello, we run the testsuite of the optimized and debug builds of Python in Fedora CI. Since the addition in https://github.com/python/cpython/commit/415964417771946dcb7a163951913adf84644b6d the test has constantly failed like this on Fedora Rawhide / Fedora Linux 41 (the development version). It passes on Fedora 40 and 39.
I was unable to reproduce this outside of Fedora CI. Perhaps this has to do with how the network is configured, no idea.
This is the output of
python3.13 -m test.pythoninfo
:We invoke the installed tests like this:
I'd like to debug this and see if something is wrong with the test or perhaps in Fedora 41. But I don't know where to start.
CPython versions tested on:
3.13
Operating systems tested on:
Linux
Linked PRs