python / cpython

The Python programming language
https://www.python.org
Other
63.37k stars 30.33k forks source link

Add C implementation of os.path.splitroot() #102511

Closed barneygale closed 3 weeks ago

barneygale commented 1 year ago

Feature or enhancement

Speed up os.path.splitroot() by implementing it in C.

Pitch

Per @eryksun:

I think splitroot() warrants a C implementation since it's a required step in our basic path handling on Windows -- join(), split(), relpath(), and commonpath(). Speeding it up gives a little boost across the board. Also, it would be less confusing if nt._path_splitroot() actually implemented ntpath.splitroot().

If implemented, using _Py_splitroot() in _Py_normpath() would help to ensure consistency. Currently, for example, ntpath.normpath('//?/UNC/server/share/../..') is correct on POSIX but wrong on Windows because _Py_normpath() incorrectly handles "//?/UNC/" as the root instead of "//?/UNC/server/share/".

Previous discussion

Linked PRs

zooba commented 1 year ago

I'd recommend making this a skiproot API rather than split, and just have it return the first index (or a pointer) after the root. This is very easy to convert into a split, but it also composes better with other APIs (e.g. _Py_normpath needs the index and not the text).

eryksun commented 1 year ago

Sounds good. We can use _Py_skiproot() in os__path_splitroot_impl() in "Modules/posixmodule.c".

Note that existing use of nt._path_splitroot() in importlib will have to be updated to handle the 3-way split (drive, root, rest) instead of (anchor, rest).

barneygale commented 1 year ago

I'm going to attempt to implement this, though my C is rather basic so it will take me some time.

Rough plan:

  1. Make nt._path_splitroot() work like ntpath.splitroot(), i.e. return a 3-tuple. Adjust importlib. Use it in ntpath where available.
  2. Add posix._path_splitroot(). Use it in posixpath where available.
  3. Add _Py_skiproot(). Call it from _Py_normpath() and _path_splitroot()

Let me know if that doesn't sound right!

eryksun commented 1 year ago

Here's an overview of what I did to experiment with this idea. I started by adding a new _Py_skiproot() internal API function in "Python/fileutils.c":

PyAPI_FUNC(wchar_t *) _Py_skiproot(wchar_t *path, Py_ssize_t size,
                                   wchar_t **root);

I moved the existing root-skipping implementation from _Py_normpath() into _Py_skiproot() and modified it to implement returning rest and root. I also added support for extended "UNC" paths. I updated _Py_normpath() to use _Py_skiproot(). Once test_ntpath and test_posixpath were successful, I moved on to implement os._path_splitroot_ex in "Modules/posixmodule.c", which I used to implement ntpath.splitroot() and posixpath.splitroot(). I again verified that test_ntpath and test_posixpath were successful.

I retained the old os._path_splitroot implementation that's used by importlib. I plan to modify importlib to use the new implementation, but only if benchmarking demonstrates that making splitroot() a builtin function is worthwhile in general. If so, I'll remove the old implementation and rename the new one to os._path_splitroot.

My initial comparison shows that, when splitting "//server/share/spam/eggs", the new builtin _path_splitroot_ex() is about 4 times faster than the fallback implementation on Windows, and it's about 1.7 times faster than the fallback implementation on Linux.

barneygale commented 1 year ago

My C isn't good enough for this, so someone else please feel free to take a stab at it!

zooba commented 5 months ago

Reopening and assigning to myself because I'm holding up the backport until we have confidence in the change. My note from the PR:

I want to hold this for a bit until we're confident in the change in later versions. Virtually every time we touch this kind of code, we introduce new failures or vulnerabilities, so let's keep it restricted to prerelease versions for now.

I'd say after 3.13.0b3 is released we'll reconsider, but that's because I'm hopeful we'll get a decent amount of people trying out those betas. If it seems like nobody is using them, we'll push this further.

pradyunsg commented 4 months ago

FWIW, we're seeing this affect pip's test suite starting 3.13.0b2 -- based on the changelog, https://github.com/python/cpython/issues/118263 was the thing that switched stuff over to using the C implementation, but I'm making a comment here since it seems relevant to the C implementation being adopted overall (IIUC).

The specific behaviour change we're seeing is that UNC paths are normalised differently.

def path_to_url(path: str) -> str:
    """
    Convert a path to a file: URL.  The path will be made absolute and have
    quoted path parts.
    """
    path = os.path.normpath(os.path.abspath(path))
    url = urllib.parse.urljoin("file:", urllib.request.pathname2url(path))
    return url

On Windows with 3.13.0b2, we're seeing failures like the following in CI:

 >       assert path_to_url(r"\\unc\as\path") == "file://unc/as/path"
E       AssertionError: assert 'file:////unc/as/path' == 'file://unc/as/path'
E         
E         - file://unc/as/path
E         + file:////unc/as/path
E         ?      ++

This passes on 3.13.0b1, at least based on the last run yesterday (I don't have a windows dev machine handy). This might be a difference in behaviour between the Python implementation and C implementation.

nineteendo commented 4 months ago

This pull request wasn't back ported to 3.12, but the bug also occurs there:

Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import abspath, normpath
>>> from urllib.parse import urljoin
>>> from urllib.request import pathname2url
>>> path_to_url = lambda path: urljoin("file:", pathname2url(normpath(abspath(path))))
>>> path_to_url(r"\\unc\as\path")
'file:////unc/as/path'
nineteendo commented 4 months ago

@pradyunsg, it's caused by #113563:

import urllib.parse
from urllib.parse import _coerce_args, urljoin, uses_netloc

def urlunsplit(components):
    scheme, netloc, url, query, fragment, _coerce_result = (_coerce_args(*components))
    if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
        if url and url[:1] != '/': url = '/' + url
        url = '//' + (netloc or '') + url
    if scheme:
        url = scheme + ':' + url
    if query:
        url = url + '?' + query
    if fragment:
        url = url + '#' + fragment
    return _coerce_result(url)

print(urljoin("file:", "////unc/as/path")) # file:////unc/as/path
urllib.parse.urlunsplit = urlunsplit
print(urljoin("file:", "////unc/as/path")) # file://unc/as/path
eryksun commented 4 months ago

Yes, in #113563, urllib.parse was switched from using 2-slash UNC paths to 4-slash UNC paths. Both forms are valid. The implementation now supports a roundtrip split/unsplit correctly for both forms. For example:

>>> r = urllib.parse.urlsplit('file:////unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file:////unc/as/path'
>>> r = urllib.parse.urlsplit('file://unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'

Previously, this didn't work correctly for 4-slash paths:

>>> r = urllib.parse.urlsplit('file:////unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'
>>> r = urllib.parse.urlsplit('file://unc/as/path')
>>> urllib.parse.urlunsplit(r)
'file://unc/as/path'
nineteendo commented 4 months ago

Eryk, is nturl2path.pathname2url() supposed to handle forward slashes?

>>> from nturl2path import pathname2url
>>> pathname2url('//unc/as/path')
'//unc/as/path'
>>> pathname2url(r'\\unc\as\path')
'////unc/as/path'
eryksun commented 4 months ago

@nineteendo, I couldn't find an issue related to this. The standard library doesn't use urllib.request.pathname2url(), so it isn't currently an internal issue. However, the documentation claims to support "the local syntax for a path", which would include using forward slash as the path separator. The Windows implementation of pathname2url() is tested in "Lib/test/test_urllib.py", but AFAICT only with backslash as the path separator. I think the input path should be normalized via normpath() and then split via splitroot() to process it sanely.

Also, currently nothing else in the standard library uses the nturl2path module, and its existence isn't documented. The two functions could be defined directly in urllib.request. The way some of the tests are implemented would have to be updated as well.

erlend-aasland commented 3 weeks ago

Closing the issue as completed, as the OP feature request seems to have been implemented.

nineteendo commented 3 weeks ago

Technically there's still #119394.