Closed barneygale closed 3 weeks ago
I'd recommend making this a skiproot
API rather than split, and just have it return the first index (or a pointer) after the root. This is very easy to convert into a split, but it also composes better with other APIs (e.g. _Py_normpath
needs the index and not the text).
Sounds good. We can use _Py_skiproot()
in os__path_splitroot_impl()
in "Modules/posixmodule.c".
Note that existing use of nt._path_splitroot()
in importlib
will have to be updated to handle the 3-way split (drive, root, rest)
instead of (anchor, rest)
.
I'm going to attempt to implement this, though my C is rather basic so it will take me some time.
Rough plan:
nt._path_splitroot()
work like ntpath.splitroot()
, i.e. return a 3-tuple. Adjust importlib
. Use it in ntpath
where available.posix._path_splitroot()
. Use it in posixpath
where available._Py_skiproot()
. Call it from _Py_normpath()
and _path_splitroot()
Let me know if that doesn't sound right!
Here's an overview of what I did to experiment with this idea. I started by adding a new _Py_skiproot()
internal API function in "Python/fileutils.c":
PyAPI_FUNC(wchar_t *) _Py_skiproot(wchar_t *path, Py_ssize_t size,
wchar_t **root);
I moved the existing root-skipping implementation from _Py_normpath()
into _Py_skiproot()
and modified it to implement returning rest
and root
. I also added support for extended "UNC" paths. I updated _Py_normpath()
to use _Py_skiproot()
. Once test_ntpath
and test_posixpath
were successful, I moved on to implement os._path_splitroot_ex
in "Modules/posixmodule.c", which I used to implement ntpath.splitroot()
and posixpath.splitroot()
. I again verified that test_ntpath
and test_posixpath
were successful.
I retained the old os._path_splitroot
implementation that's used by importlib
. I plan to modify importlib
to use the new implementation, but only if benchmarking demonstrates that making splitroot()
a builtin function is worthwhile in general. If so, I'll remove the old implementation and rename the new one to os._path_splitroot
.
My initial comparison shows that, when splitting "//server/share/spam/eggs", the new builtin _path_splitroot_ex()
is about 4 times faster than the fallback implementation on Windows, and it's about 1.7 times faster than the fallback implementation on Linux.
My C isn't good enough for this, so someone else please feel free to take a stab at it!
Reopening and assigning to myself because I'm holding up the backport until we have confidence in the change. My note from the PR:
I want to hold this for a bit until we're confident in the change in later versions. Virtually every time we touch this kind of code, we introduce new failures or vulnerabilities, so let's keep it restricted to prerelease versions for now.
I'd say after 3.13.0b3 is released we'll reconsider, but that's because I'm hopeful we'll get a decent amount of people trying out those betas. If it seems like nobody is using them, we'll push this further.
FWIW, we're seeing this affect pip's test suite starting 3.13.0b2 -- based on the changelog, https://github.com/python/cpython/issues/118263 was the thing that switched stuff over to using the C implementation, but I'm making a comment here since it seems relevant to the C implementation being adopted overall (IIUC).
The specific behaviour change we're seeing is that UNC paths are normalised differently.
def path_to_url(path: str) -> str:
"""
Convert a path to a file: URL. The path will be made absolute and have
quoted path parts.
"""
path = os.path.normpath(os.path.abspath(path))
url = urllib.parse.urljoin("file:", urllib.request.pathname2url(path))
return url
On Windows with 3.13.0b2, we're seeing failures like the following in CI:
> assert path_to_url(r"\\unc\as\path") == "file://unc/as/path"
E AssertionError: assert 'file:////unc/as/path' == 'file://unc/as/path'
E
E - file://unc/as/path
E + file:////unc/as/path
E ? ++
This passes on 3.13.0b1, at least based on the last run yesterday (I don't have a windows dev machine handy). This might be a difference in behaviour between the Python implementation and C implementation.
This pull request wasn't back ported to 3.12, but the bug also occurs there:
Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import abspath, normpath
>>> from urllib.parse import urljoin
>>> from urllib.request import pathname2url
>>> path_to_url = lambda path: urljoin("file:", pathname2url(normpath(abspath(path))))
>>> path_to_url(r"\\unc\as\path")
'file:////unc/as/path'
@pradyunsg, it's caused by #113563:
import urllib.parse
from urllib.parse import _coerce_args, urljoin, uses_netloc
def urlunsplit(components):
scheme, netloc, url, query, fragment, _coerce_result = (_coerce_args(*components))
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url
url = '//' + (netloc or '') + url
if scheme:
url = scheme + ':' + url
if query:
url = url + '?' + query
if fragment:
url = url + '#' + fragment
return _coerce_result(url)
print(urljoin("file:", "////unc/as/path")) # file:////unc/as/path
urllib.parse.urlunsplit = urlunsplit
print(urljoin("file:", "////unc/as/path")) # file://unc/as/path
Yes, in #113563, urllib.parse
was switched from using 2-slash UNC paths to 4-slash UNC paths. Both forms are valid. The implementation now supports a roundtrip split/unsplit correctly for both forms. For example:
>>> r = urllib.parse.urlsplit('file:////unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file:////unc/as/path'
>>> r = urllib.parse.urlsplit('file://unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'
Previously, this didn't work correctly for 4-slash paths:
>>> r = urllib.parse.urlsplit('file:////unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'
>>> r = urllib.parse.urlsplit('file://unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'
Eryk, is nturl2path.pathname2url()
supposed to handle forward slashes?
>>> from nturl2path import pathname2url
>>> pathname2url('//unc/as/path')
'//unc/as/path'
>>> pathname2url(r'\\unc\as\path')
'////unc/as/path'
@nineteendo, I couldn't find an issue related to this. The standard library doesn't use urllib.request.pathname2url()
, so it isn't currently an internal issue. However, the documentation claims to support "the local syntax for a path", which would include using forward slash as the path separator. The Windows implementation of pathname2url()
is tested in "Lib/test/test_urllib.py", but AFAICT only with backslash as the path separator. I think the input path should be normalized via normpath()
and then split via splitroot()
to process it sanely.
Also, currently nothing else in the standard library uses the nturl2path
module, and its existence isn't documented. The two functions could be defined directly in urllib.request
. The way some of the tests are implemented would have to be updated as well.
Closing the issue as completed, as the OP feature request seems to have been implemented.
Technically there's still #119394.
Feature or enhancement
Speed up
os.path.splitroot()
by implementing it in C.Pitch
Per @eryksun:
Previous discussion
Linked PRs