python / cpython

The Python programming language
https://www.python.org
Other
63.54k stars 30.45k forks source link

Tests crash on macos Sonoma #109981

Closed sobolevn closed 11 months ago

sobolevn commented 1 year ago

Crash report

I've experienced a crash on main branch. I just used ./python.exe -m test

This is the all output I have for now. Cannot reproduce with smaller set of tests.

0:13:53 load avg: 2.12 [371/463/4] test_super
0:13:53 load avg: 2.12 [372/463/4] test_support
[1]    12160 killed     ./python.exe -m test

Linked PRs

sobolevn commented 1 year ago

With more info:

test_detect_api_mismatch (test.test_support.TestSupport.test_detect_api_mismatch) ... ok
test_detect_api_mismatch__ignore (test.test_support.TestSupport.test_detect_api_mismatch__ignore) ... ok
test_fd_count (test.test_support.TestSupport.test_fd_count) ... [1]    16728 killed     PYTHONFAULTHANDLER=1 ./python.exe -m test -v

However, this is dependent on some other test, for now I don't know which one.

vstinner commented 1 year ago

I don't know which one.

Try to run:

./python.exe -m test.bisect_cmd -o bisect test_support

Then look into bisect file result.

sobolevn commented 1 year ago

I wrote this simple script (because bisect's output didn't reproduce this problem):

set -e tests=$(cat << EOF test.test_asyncio.test_base_events test.test_asyncio.test_buffered_proto test.test_asyncio.test_context test.test_asyncio.test_eager_task_factory test.test_asyncio.test_events test.test_asyncio.test_futures test.test_asyncio.test_futures2 test.test_asyncio.test_locks test.test_asyncio.test_pep492 test.test_asyncio.test_proactor_events test.test_asyncio.test_protocols test.test_asyncio.test_queues test.test_asyncio.test_runners test.test_asyncio.test_selector_events test.test_asyncio.test_sendfile test.test_asyncio.test_server test.test_asyncio.test_sock_lowlevel test.test_asyncio.test_ssl test.test_asyncio.test_sslproto test.test_asyncio.test_streams test.test_asyncio.test_subprocess test.test_asyncio.test_taskgroups test.test_asyncio.test_tasks test.test_asyncio.test_threads test.test_asyncio.test_timeouts test.test_asyncio.test_transports test.test_asyncio.test_unix_events test.test_asyncio.test_waitfor test.test_asyncio.test_windows_events test.test_asyncio.test_windows_utils test.test_concurrent_futures.test_as_completed test.test_concurrent_futures.test_deadlock test.test_concurrent_futures.test_future test.test_concurrent_futures.test_init test.test_concurrent_futures.test_process_pool test.test_concurrent_futures.test_shutdown test.test_concurrent_futures.test_thread_pool test.test_concurrent_futures.test_wait test.test_future_stmt.test_future test.test_future_stmt.test_future_flags test.test_future_stmt.test_future_multiple_features test.test_future_stmt.test_future_multiple_imports test.test_future_stmt.test_future_single_import test.test_multiprocessing_fork.test_manager test.test_multiprocessing_fork.test_misc test.test_multiprocessing_fork.test_processes test.test_multiprocessing_fork.test_threads test.test_multiprocessing_forkserver.test_manager test.test_multiprocessing_forkserver.test_misc test.test_multiprocessing_forkserver.test_processes test.test_multiprocessing_forkserver.test_threads test.test_multiprocessing_spawn.test_manager test.test_multiprocessing_spawn.test_misc test.test_multiprocessing_spawn.test_processes test.test_multiprocessing_spawn.test_threads test___all__ test__locale test__opcode test__osx_support test__xxinterpchannels test__xxsubinterpreters test_abc test_abstract_numbers test_argparse test_array test_asdl_parser test_ast test_asyncgen test_atexit test_audit test_augassign test_base64 test_baseexception test_bdb test_bigaddrspace test_bigmem test_binascii test_binop test_bisect test_bool test_buffer test_bufio test_builtin test_bytes test_bz2 test_calendar test_call test_capi test_charmapcodec test_class test_clinic test_cmath test_cmd test_cmd_line test_cmd_line_script test_code test_code_module test_codeccallbacks test_codecencodings_cn test_codecencodings_hk test_codecencodings_iso2022 test_codecencodings_jp test_codecencodings_kr test_codecencodings_tw test_codecmaps_cn test_codecmaps_hk test_codecmaps_jp test_codecmaps_kr test_codecmaps_tw test_codecs test_codeop test_collections test_colorsys test_compare test_compile test_compileall test_compiler_assemble test_compiler_codegen test_complex test_configparser test_contains test_context test_contextlib test_contextlib_async test_copy test_copyreg test_coroutines test_cppext test_cprofile test_csv test_ctypes test_curses test_dataclasses test_datetime test_dbm test_dbm_dumb test_dbm_gnu test_dbm_ndbm test_decimal test_decorators test_defaultdict test_deque test_descr test_descrtut test_devpoll test_dict test_dict_version test_dictcomps test_dictviews test_difflib test_dis test_doctest test_doctest2 test_docxmlrpc test_dtrace test_dynamic test_dynamicclassattribute test_eintr test_email test_embed test_ensurepip test_enum test_enumerate test_eof test_epoll test_errno test_except_star test_exception_group test_exception_hierarchy test_exception_variations test_exceptions test_extcall test_faulthandler test_fcntl test_file test_file_eintr test_filecmp test_fileinput test_fileio test_fileutils test_finalization test_float test_flufl test_fnmatch test_fork1 test_format test_fractions test_frame test_frozen test_fstring test_ftplib test_funcattrs test_functools test_gc test_gdb test_generated_cases test_generator_stop test_generators test_genericalias test_genericclass test_genericpath test_genexps test_getopt test_getpass test_getpath test_gettext test_glob test_global test_grammar test_graphlib test_grp test_gzip test_hash test_hashlib test_heapq test_hmac test_html test_htmlparser test_http_cookiejar test_http_cookies test_httplib test_httpservers test_idle test_imaplib test_import test_importlib test_index test_inspect test_int test_int_literal test_interpreters test_io test_ioctl test_ipaddress test_isinstance test_iter test_iterlen test_itertools test_json test_keyword test_keywordonlyarg test_kqueue test_largefile test_launcher test_linecache test_list test_listcomps test_lltrace test_locale test_logging test_long test_longexp test_lzma test_mailbox test_marshal test_math test_math_property test_memoryio test_memoryview test_metaclass test_mimetypes test_minidom test_mmap test_module test_modulefinder test_monitoring test_msvcrt test_multibytecodec test_multiprocessing_main_handling test_named_expressions test_netrc test_ntpath test_numeric_tower test_opcache test_opcodes test_openpty test_operator test_optparse test_ordered_dict test_os test_osx_env test_pathlib test_patma test_pdb test_peepholer test_peg_generator test_pep646_syntax test_perf_profiler test_perfmaps test_pickle test_picklebuffer test_pickletools test_pkg test_pkgutil test_platform test_plistlib test_poll test_popen test_poplib test_positional_only_arg test_posix test_posixpath test_pow test_pprint test_print test_profile test_property test_pstats test_pty test_pulldom test_pwd test_py_compile test_pyclbr test_pydoc test_pyexpat test_queue test_quopri test_raise test_random test_range test_re test_readline test_regrtest test_repl test_reprlib test_resource test_richcmp test_rlcompleter test_robotparser test_runpy test_sax test_sched test_scope test_script_helper test_secrets test_select test_selectors test_set test_setcomps test_shelve test_shlex test_signal test_site test_slice test_smtplib test_smtpnet test_socket test_socketserver test_sort test_source_encoding test_sqlite3 test_ssl test_stable_abi_ctypes test_startfile test_stat test_statistics test_str test_strftime test_string test_string_literals test_stringprep test_strptime test_strtod test_struct test_structseq test_subclassinit test_subprocess test_sundry test_super EOF ) stringarray=($tests) for i in "${stringarray[@]}"; do ./python.exe -m test $i test_support done

And tests still pass. So, I think that we need several tests, not just two.

sobolevn commented 1 year ago

@shamsherfairfest this is unrelated, please open a new issue with the minimal reproduction. We would be happy to help you in that case :)

shamsherfairfest commented 1 year ago

@shamsherfairfest this is unrelated, please open a new issue with the minimal reproduction. We would be happy to help you in that case :)

ok sure thanks

vstinner commented 1 year ago

I wrote this simple script (because bisect's output didn't reproduce this problem)

Did you run test.bisect_cmd on test_support, or on the whole Python test suite? What do you mean that it doesn't reproduce this problem? Which behavior do you get when you run ./python.exe -m test test_support --matchfile=bisect (or ./python.exe -m test --matchfile=bisect if you ran it in the whole test suite)?

sobolevn commented 1 year ago

I've reproduced it with only three test files: ./python.exe -m test test_sax test_socket test_support

Now, I need to find exact tests that are related. It might be related to https://github.com/python/cpython/pull/110013 since it is also about test_socket

vstinner commented 1 year ago

I've reproduced it with only three test files: ./python.exe -m test test_sax test_socket test_support

Wow, that's very helpful, thanks!

Now, I need to find exact tests that are related.

You may give a new try to bisect_cmd:

./python.exe -m test.bisect_cmd -o bisect test_sax test_socket test_support

And then:

./python.exe -m test --matchfile=bisect test_sax test_socket test_support

It should help to reduce the quantity of code that you have to review, by skipping many tests, and help you to find the minimum set of tests which trigger the bug.

vstinner commented 1 year ago

By the way, it would be interesting to have a "files" mode in bisect_cmd to filter test files, rather than test cases.

sobolevn commented 1 year ago

I found all three offending tests 🎉

test.test_sax.ExpatReaderTest.test_expat_external_dtd_enabled
test.test_socket.GeneralModuleTests.testHostnameRes
test.test_support.TestSupport.test_fd_count
sobolevn commented 1 year ago

Reproducer:

from xml.sax.xmlreader import InputSource
from xml.sax.expatreader import create_parser
from xml.sax.handler import feature_external_ges

class TestEntityRecorder:
    def __init__(self):
        self.entities = []

    def resolveEntity(self, publicId, systemId):
        self.entities.append((publicId, systemId))
        source = InputSource()
        source.setPublicId(publicId)
        source.setSystemId(systemId)
        return source

parser = create_parser()
parser.setFeature(feature_external_ges, True)
resolver = TestEntityRecorder()
parser.setEntityResolver(resolver)

try:
    parser.feed(
        '<!DOCTYPE external SYSTEM "unsupported://non-existing">\n'
    )
except Exception:
    pass

import socket
hostname = socket.gethostname()
ip = socket.gethostbyname(hostname)

from test.support import os_helper
os_helper.fd_count()
vstinner commented 1 year ago
from xml.sax.xmlreader import InputSource
from xml.sax.expatreader import create_parser
from xml.sax.handler import feature_external_ges

I'm now curious about the interaction between SAX parser, gethostname(), gethostbyname() and fd_count().

IMO it's an OS bug, you might need attempt to go further and have like no "import", and even write a C reproducer. Tell me if you need help for that, so you can report it to Apple. We have a few core devs who are in touch with Apple.

cc @ned-deily @ronaldoussoren

ned-deily commented 1 year ago

Is there a macOS crash report with stack trace? Check Console.app or look in ~/Library/Logs/DiagnosticReports.

sobolevn commented 1 year ago

Minimizing reproducer even more:

from xml.sax.expatreader import create_parser
from xml.sax.handler import feature_external_ges

parser = create_parser()
parser.setFeature(feature_external_ges, True)

try:
    parser.feed(
        '<!DOCTYPE external SYSTEM "unsupported://non-existing">\n'
    )
except Exception:
    pass

import socket
hostname = socket.gethostname()
ip = socket.gethostbyname(hostname)

import os
for fd in range(os.sysconf("SC_OPEN_MAX")):
    fd2 = os.dup(fd)

# or
# import fcntl
# for fd in range(os.sysconf("SC_OPEN_MAX")):
#     fcntl.fcntl(fd, fcntl.F_DUPFD_CLOEXEC, 0)

It crashes on this line: https://github.com/python/cpython/blob/b987fdb19b981ef6e7f71b41790b5ed4e2064646/Python/fileutils.c#L2594

@ned-deily I don't see a stacktrace :(

sobolevn commented 1 year ago

So, the crash goes away if I just change https://github.com/python/cpython/blob/b987fdb19b981ef6e7f71b41790b5ed4e2064646/Lib/xml/sax/saxutils.py#L365 to be just f = b''. If I do that it switches to OSError: [Errno 24] Too many open files

Not sure what role socket call plays in all of this.

ned-deily commented 1 year ago

You may not get a crash dump file if the executable is not running as an app bundle. Using the python.org 3.12.0 for macOS to run the minimal reproducer above, I get a crash file with the following relevant info:

...
Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_GUARD (SIGKILL)
Exception Codes:       GUARD_TYPE_FD
Exception Codes:       0x0000000000000003, 0xc7a50e7608548068

Termination Reason:    Namespace GUARD, Code 4611686027017322499 

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib                 0x1828cf41c __fcntl + 8
1   libsystem_kernel.dylib                 0x1828cf3e8 fcntl + 88
2   Python                                 0x105c1bbb4 _Py_dup + 120
3   Python                                 0x105c2c65c os_dup + 28
4   Python                                 0x105b7f074 _PyEval_EvalFrameDefault + 50916
...

According to some Apple developer documentation:

EXC_GUARD indicates the process violated a guarded resource protection. Although there are multiple types of guarded system resources, most guarded resource crashes are from guarded file descriptors, which have the GUARD_TYPE_FD value in the Exception Subtype field. The operating system marks a file descriptor as guarded so that normal file descriptor APIs can’t modify them. [...] DUP. The process attempted to invoke dup(), dup2(), or fcntl() with the F_DUPFD or F_DUPFD_CLOEXEC commands on a guarded file descriptor.

So it would seem this is a test case error with consequnces on macOS. Perhaps the first part of the test (using xml.sax) is leaving file descriptors open?

ronaldoussoren commented 11 months ago

If I run @sobolevn's script above I also get a crash, and get the following information on open files using lsof(1):

python.ex 33790 ronald    0r      CHR               16,3      0t451      1229 /dev/ttys003
python.ex 33790 ronald    1w      CHR               16,3      0t535      1229 /dev/ttys003
python.ex 33790 ronald    2w      CHR               16,3      0t187      1229 /dev/ttys003
python.ex 33790 ronald    3   NPOLICY                                         
python.ex 33790 ronald    4u    systm 0x375ff3a427341a21        0t0           [ctl com.apple.netsrc id 7 unit 44]
python.ex 33790 ronald    5u     unix 0x375ff3a8f44362c1        0t0           ->0x375ff3a8f4436771
python.ex 33790 ronald    6r      CHR               16,3      0t451      1229 /dev/ttys003
python.ex 33790 ronald    7w      CHR               16,3      0t535      1229 /dev/ttys003
python.ex 33790 ronald    8w      CHR               16,3      0t187      1229 /dev/ttys003

The crash happens due to calling dup on file descriptor 3:

(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_GUARD (code=4611686027017322499, subcode=0x4f84efa691046cd)
  * frame #0: 0x0000000189b3541c libsystem_kernel.dylib`__fcntl + 8
    frame #1: 0x0000000189b353e8 libsystem_kernel.dylib`fcntl + 88
    frame #2: 0x0000000100270228 python.exe`_Py_dup(fd=3) at fileutils.c:2599:10 [opt]
    frame #3: 0x0000000100280d60 python.exe`os_dup [inlined] os_dup_impl(module=<unavailable>, fd=3) at posixmodule.c:10590:12 [opt]
    frame #4: 0x0000000100280d58 python.exe`os_dup(module=<unavailable>, arg=<unavailable>) at posixmodule.c.h:6758:21 [opt]

BTW. The earlier script using os_helper.fd_count doesn't crash for me.

Running the problematic test cases gives the same crash:

(lldb) r -m test test_sax test_socket test_support
Process 33933 launched: '/Users/ronald/Projects/Forks/cpython-upstream/build/python.exe' (arm64)
Using random seed: 1511475695
Raised RLIMIT_NOFILE: 256 -> 1024
0:00:00 load avg: 3.26 Run 3 tests sequentially
0:00:00 load avg: 3.26 [1/3] test_sax
0:00:00 load avg: 3.24 [2/3] test_socket
2023-12-06 11:17:05.606880+0100 python.exe[33933:11421359] [si_destination_compare] send failed: Invalid argument
2023-12-06 11:17:05.606894+0100 python.exe[33933:11421359] [si_destination_compare] send failed: Undefined error: 0
2023-12-06 11:17:05.606897+0100 python.exe[33933:11421359] [si_destination_compare] send failed: Invalid argument
/Users/ronald/Projects/Forks/cpython-upstream/Lib/test/test_socket.py:2934: RuntimeWarning: received malformed or improperly-truncated ancillary data
  result = sock.recvmsg_into([buf], *args)
/Users/ronald/Projects/Forks/cpython-upstream/Lib/test/test_socket.py:2843: RuntimeWarning: received malformed or improperly-truncated ancillary data
  result = sock.recvmsg(bufsize, *args)
0:00:29 load avg: 3.15 [3/3] test_support
Process 33933 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_GUARD (code=4611686027017322499, subcode=0x55c498140be65944)
    frame #0: 0x0000000189b3541c libsystem_kernel.dylib`__fcntl + 8
libsystem_kernel.dylib`:
->  0x189b3541c <+8>:  b.lo   0x189b3543c               ; <+40>
    0x189b35420 <+12>: pacibsp 
    0x189b35424 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x189b35428 <+20>: mov    x29, sp
Target 0: (python.exe) stopped.
(lldb) thread backtrace -c 10
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_GUARD (code=4611686027017322499, subcode=0x55c498140be65944)
  * frame #0: 0x0000000189b3541c libsystem_kernel.dylib`__fcntl + 8
    frame #1: 0x0000000189b353e8 libsystem_kernel.dylib`fcntl + 88
    frame #2: 0x0000000100270228 python.exe`_Py_dup(fd=3) at fileutils.c:2599:10 [opt]
    frame #3: 0x0000000100280d60 python.exe`os_dup [inlined] os_dup_impl(module=<unavailable>, fd=3) at posixmodule.c:10590:12 [opt]
    frame #4: 0x0000000100280d58 python.exe`os_dup(module=<unavailable>, arg=<unavailable>) at posixmodule.c.h:6758:21 [opt]
    frame #5: 0x00000001001bac48 python.exe`_PyEval_EvalFrameDefault(tstate=<unavailable>, frame=0x0000000100985130, throwflag=0) at generated_cases.c.h:1134:19 [opt]
    frame #6: 0x0000000100095f24 python.exe`_PyObject_VectorcallTstate(tstate=0x00000001005ce780, callable=0x0000000100fbdf10, args=0x000000016fdfd380, nargsf=2, kwnames=0x0000000000000000) at pycore_call.h:168:11 [opt]
    frame #7: 0x0000000100095144 python.exe`method_vectorcall(method=<unavailable>, args=0x0000000101b2d0f8, nargsf=<unavailable>, kwnames=0x0000000000000000) at classobject.c:92:18 [opt]

From the limited research I've done it is possible to detect if a file descriptor is guarded using the undocumented proc_pidfdinfo, although I haven't tried to use this yet.

That said, the easier fix on (recent?) versions of macOS is to use /dev/fd to check for open files in test.support.os_helper.fd_count, just as /proc/self/fd is used on Linux and FreeBSD.

ronaldoussoren commented 11 months ago

With the patch below ./python.exe -m test test_sax test_socket test_support -v no longer crashes. Working on a PR.

diff --git a/Lib/test/support/os_helper.py b/Lib/test/support/os_helper.py
index 46ae53aa11..ca0b321f9e 100644
--- a/Lib/test/support/os_helper.py
+++ b/Lib/test/support/os_helper.py
@@ -600,6 +600,15 @@ def fd_count():
         except FileNotFoundError:
             pass

+    if sys.platform == "darwin":
+        try:
+            names = os.listdir("/dev/fd")
+            # Subtract one because listdir() internally opens a file
+            # descriptor to list the content of the /proc/self/fd/ directory.
+            return len(names) - 1
+        except FileNotFoundError:
+            pass
+
     MAXFD = 256
     if hasattr(os, 'sysconf'):
         try: