python / cpython

The Python programming language
https://www.python.org
Other
63.38k stars 30.35k forks source link

`ntpath.splitroot` raises `UnicodeDecodeError` when given `bytes` on Windows #122143

Open tjensen opened 3 months ago

tjensen commented 3 months ago

Bug report

Bug description:

The ntpath.splitroot function appears to have changed in Python 3.13 such that it now raises a UnicodeDecodeError when the given pathname is a bytes containing invalid Unicode characters, but only when running on Windows:

Python 3.13.0b4 (tags/v3.13.0b4:567c38b, Jul 18 2024, 10:14:53) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ntpath
>>> ntpath.splitroot(b"foo\x88")
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    ntpath.splitroot(b"foo\x88")
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 3: invalid start byte

The same code works without raising on Windows when using Python 3.12:

Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ntpath
>>> ntpath.splitroot(b"foo\x88")
(b'', b'', b'foo\x88')

The same code also works without raising on Linux when using Python 3.13 or 3.12:

Python 3.13.0b4 (main, Jul 22 2024, 17:26:46) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ntpath
>>> ntpath.splitroot(b"foo\x88")
(b'', b'', b'foo\x88')
Python 3.12.4 (main, Jun 15 2024, 10:31:39) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ntpath
>>> ntpath.splitroot(b"foo\x88")
(b'', b'', b'foo\x88')

CPython versions tested on:

3.12, 3.13

Operating systems tested on:

Linux, Windows

eryksun commented 3 months ago

This limitation on Windows is because the error handler of the filesystem encoding is required to be "surrogatepass" instead of "surrogateescape". In principle, builtin nt._path_splitroot_ex() could handle UnicodeDecodeError, or any other ValueError, by using the C API to call ntpath._splitroot_fallback(). This would require enabling the suppress_value_error option of the path_t argument converter.

zooba commented 3 months ago

the error handler of the filesystem encoding is required to be "surrogatepass" instead of "surrogateescape".

Why have we never noticed this before? We can just fix that, I believe - the filesystem encoding on Windows is just a compatibility hack to support POSIX developers (I'm pretty sure I wrote something to that effect in PEP 528 or 529 or whichever one it was).