python / cpython

The Python programming language
https://www.python.org
Other
63.27k stars 30.29k forks source link

Revisit adding lexical normalization support to `pathlib` #124825

Open ncoghlan opened 4 weeks ago

ncoghlan commented 4 weeks ago

Feature or enhancement

Proposal:

I'd like to add a resolve_lexical method to concrete pathlib objects:

def resolve_lexical(self, /, strict=False):
    """Make the path absolute, and also normalize it, *without* resolving symlinks."""

As with resolve(), if strict is True and any segment of the given path doesn't exist, FileNotFoundError is raised (so /some/dir/_nonexistent_/.. would fail). If strict is False (the default), all path segments are processed without checking whether they exist (so /some/dir/_nonexistent_/.. lexically resolves to /some/dir).

While theoretically this could be added to PurePath without the strict option, I don't see any significant benefit to that (whereas I do see benefits to paralleling the Path.resolve() API as closely as possible).

As a minor note, adding this method would give a more direct way of checking if a path contains any symlinks at any level: path.resolve() == path.resolve_lexical() (vs the current path.is_symlink() or any(segment.is_symlink() for segment in path.parents)).

Chaining the two resolution methods would also be valid (path.resolve().resolve_lexical()), with symlinks then being resolved in the segments that actually exist, and the rest of the path, if any, being resolved lexically)


On my current project, I recently ran into a pair of subtle symlink-and-relative-path-handling bugs.

The resolution that handled both situations ended up being to use os.path to perform lexical normalisation (via os.path.abspath):

import os
import os.path

from pathlib import Path

def as_normalized_path(path:str|os.PathLike[str], /) -> Path:
    """Normalize given path and make it absolute, *without* resolving symlinks

    Expands user directory references, but *not* environment variable references.
    """
    # Ensure user directory references are handled as absolute paths
    expanded_path = os.path.expanduser(path)
    return Path(os.path.abspath(expanded_path))

Having to drop down to the lower level API to request "resolve /../ relative to the path as given" instead of the default "resolve /../ relative to symlink targets" feels like an unnecessary gap in the abstraction layer.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

Previously suggested here: https://github.com/python/cpython/issues/83105

When it comes to symlink security vulnerabilities arising from parent directory traversal, they mostly relate to using symlink resolution to reach an unexpected location:

As a result, the rationale for rejection doesn't feel strong to me (since the intuitive behaviour is unavailable in the high level API, and instead only the subtle system state dependent behaviour is offered)

Also noting that Java does offer lexical normalisation on its Path abstraction: https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/nio/file/Path.html#normalize()

serhiy-storchaka commented 4 weeks ago

What is the use case for strict=True in context of lexical normalization?

ncoghlan commented 4 weeks ago

If you want to rule out cases like /some/dir/_nonexistent_/.., that's hard to do without checking the filesystem as you go.

I don't mind leaving out strict=True though.

I only included it in my suggested API because Path.resolve() defines it (making it a smaller code transformation to switch between the two resolution methods), there is a sensible meaning for it (even in lexical resolution), and I don't think it would substantially add to the maintenance burden for the feature.

Edit: even if we leave out strict=True, I think the implicit Path.absolute() invocation still means this method should only exist on the concrete Path classes, and not on PurePath.

barneygale commented 4 weeks ago

As a result, the rationale for rejection doesn't feel strong to me (since the intuitive behaviour is unavailable in the high level API, and instead only the subtle system state dependent behaviour is offered)

Opinions differ on this point! e.g. PEP 428 says:

Sane behaviour

Little of the functionality from os.path is reused. Many os.path functions are tied by backwards compatibility to confusing or plain wrong behaviour (for example, the fact that os.path.abspath() simplifies ".." path components without resolving symlinks first).

ncoghlan commented 4 weeks ago

Ah, I knew I had seen a written objection to lexical normalisation somewhere, but didn't think to check the original PEP!

The context where this came up for me is symlinked virtual environments: fully resolving sys.executable inadvertently breaks you out of the venv, whereas lexical resolution respects the facade being presented to the application.

I think (but haven't confirmed), there are also some symlink games happening in the python-build-standalone macOS builds.

barneygale commented 3 weeks ago

I think I'd be OK with adding this as a PurePath method that throws ValueError when given a non-absolute path, which is similar to PurePath.as_uri().

Users of Path objects could do path.absolute().normalized(), which is more wordy, but I think that's fine as they almost always want resolve() or absolute() rather than normalized().

I don't like a name involving "resolve" because I think it could imply symlink resolution, which isn't what we're doing. Nor are we using the POSIX algorithm for pathname resolution. Adding "lexical" helps, but I think there's still a small potential for confusion. Maybe.

Names involving "normalize" are a tiny bit problematic because pathlib already performs path normalization (specifically, removal of redundant slashes and . segments).