python / cpython

The Python programming language
https://www.python.org
Other
63.48k stars 30.4k forks source link

os.walk always follows Windows junctions #67596

Open 7e6c45ab-cd55-40b9-9429-49bb488ca339 opened 9 years ago

7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 9 years ago
BPO 23407
Nosy @tjguk, @zware, @eryksun, @zooba, @jamercee, @efahl
Dependencies
  • bpo-29248: os.readlink fails on Windows
  • Files
  • issue23407.patch
  • issue23407-2.patch
  • issue23407-3.patch
  • issue23407-4.patch
  • issue23407-5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', '3.9', '3.10', 'type-feature', 'library', 'OS-windows'] title = 'os.walk always follows Windows junctions' updated_at = user = 'https://bugs.python.org/craigh' ``` bugs.python.org fields: ```python activity = actor = 'eryksun' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'Windows'] creation = creator = 'craigh' dependencies = ['29248'] files = ['44823', '44824', '46289', '46291', '46718'] hgrepos = [] issue_num = 23407 keywords = ['patch'] message_count = 16.0 messages = ['235531', '235533', '277385', '277389', '277393', '285273', '285301', '285351', '285355', '285458', '285486', '285488', '285499', '356677', '356882', '392674'] nosy_count = 7.0 nosy_names = ['tim.golden', 'craigh', 'zach.ware', 'eryksun', 'steve.dower', 'jamercee', 'eric.fahlgren'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue23407' versions = ['Python 3.8', 'Python 3.9', 'Python 3.10'] ```

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 9 years ago

    os.walk follows Windows junctions even if followlinks is False:

    >>> import os
    >>> appdata = os.environ['LOCALAPPDATA']
    >>> for root, dirs, files in os.walk(appdata, followlinks=False):
    ... print(root)

    C:\Users\Test\AppData\Local C:\Users\Test\AppData\Local\Apple C:\Users\Test\AppData\Local\Apple\Apple Software Update C:\Users\Test\AppData\Local\Apple Computer C:\Users\Test\AppData\Local\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data C:\Users\Test\AppData\Local\Application Data\Apple C:\Users\Test\AppData\Local\Application Data\Apple\Apple Software Update C:\Users\Test\AppData\Local\Application Data\Apple Computer C:\Users\Test\AppData\Local\Application Data\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data\Application Data C:\Users\Test\AppData\Local\Application Data\Application Data\Apple C:\Users\Test\AppData\Local\Application Data\Application Data\Apple\Apple Software Update C:\Users\Test\AppData\Local\Application Data\Application Data\Apple Computer C:\Users\Test\AppData\Local\Application Data\Application Data\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple\Apple Software Update C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple Computer C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Application Data [...]

    For directory symbolic links, os.walk seems to have the correct behavior. However, Windows 7 (at least) employs junctions instead of symlinks in situations like the default user profile layout, i.e. the "Application Data" junction shown above.

    I also noticed that, for junctions, os.path.islink returns False but os.stat and os.lstat return different results.

    eryksun commented 9 years ago

    To check for a link on Windows, os.walk calls ntpath.islink, which calls os.lstat. Currently the os.lstat implementation only sets S_IFLNK for symbolic links. attribute_data_to_stat could also check for junctions (IO_REPARSE_TAG_MOUNT_POINT). For consistency, os.readlink should also read junctions (rdb->MountPointReparseBuffer).

    islink https://hg.python.org/cpython/file/7b493dbf944b/Lib/ntpath.py#l239

    attribute_data_to_stat https://hg.python.org/cpython/file/7b493dbf944b/Modules/posixmodule.c#l1515

    win_readlink https://hg.python.org/cpython/file/7b493dbf944b/Modules/posixmodule.c#l10056

    REPARSE_DATA_BUFFER https://hg.python.org/cpython/file/7b493dbf944b/Modules/winreparse.h#l11

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 8 years ago

    The attached patch changes _Py_attribute_data_to_stat to set S_IFLNK for both symlinks and junctions, and changes win_readlink to return the target path for junctions (IO_REPARSE_TAG_MOUNT_POINT) as well as symlinks.

    I'm not sure what to do as far as adding a test--either Python needs a way to create junctions or the test needs to rely on the ones Windows creates by default.

    Incidentally, the existing win_readlink doesn't always work correctly with symbolic links, either (this is from 3.5.2):

    >>> import os
    >>> os.readlink(r'C:\Users\All Users')
    '\x00\x00f\x00\u0201\x00\x02\x00\x00\x00f\x00\x00\x00'

    The problem is that PrintNameOffset is an offset in bytes, so it needs to be divided by sizeof(WCHAR) if you're going to add it to a WCHAR pointer (https://msdn.microsoft.com/en-us/library/windows/hardware/ff552012(v=vs.85).aspx). Some links still seem to work correctly because PrintNameOffset is 0. The attached patch fixes this problem also--I wasn't sure if I should open a separate issue for it.

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 8 years ago

    Actually, it looks like there is already a way to create junctions and a test for them in test_os. However, it includes this line:

    # Junctions are not recognized as links. self.assertFalse(os.path.islink(self.junction))

    That suggests the old behavior is intentional--does anyone know why?

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 8 years ago

    Updated patch with changes to Win32JunctionTests.

    d129e438-bf4f-4062-bdc5-351b2252f998 commented 7 years ago

    Junctions are not recognized as links. self.assertFalse(os.path.islink(self.junction))

    If the above comment is intended as a statement of fact, then it's inconsistent with the implementation of Py_DeleteFileW ( https://hg.python.org/cpython/file/v3.6.0/Modules/posixmodule.c#l4178 ).

    eryksun commented 7 years ago

    I opened bpo-29248 for the os.readlink bug and bpo-29250 for the inconsistency between os.path.islink and os.stat.

    Handling junctions as links is new behavior, so I've changed this issue to be an enhancement for 3.7.

    If the notion of a link is generalized to junctions, then maybe it should be further generalized to include all name-surrogate reparse tags 1. Currently for Microsoft tags this includes

    IO_REPARSE_TAG_MOUNT_POINT (junctions)
    IO_REPARSE_TAG_SYMLINK
    IO_REPARSE_TAG_IIS_CACHE

    For non-Microsoft tags it includes

    IO_REPARSE_TAG_SOLUTIONSOFT
    IO_REPARSE_TAG_OSR_SAMPLE
    IO_REPARSE_TAG_QI_TECH_HSM
    IO_REPARSE_TAG_MAXISCALE_HSM

    The last two are outliers. HSM isn't the kind of immediate, fast access that one would expect from a symbolic link. All other HSM tags aren't categorized as name surrogates.

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 7 years ago

    Can you point me toward any documentation on the additional tags you want to support? Searching for IO_REPARSE_TAG_IIS_CACHE mostly seems to yield header files that define it (and nothing at all on MSDN), and the non-Microsoft tags just yield a few results each.

    (For comparison, the junction and symbolic link tags yield 10K+ results each.)

    Junctions are created with each user's home directory so they exist on every Windows system, even if the user never explicitly creates them. The additional tags seem like they're far less common and much less well-documented.

    eryksun commented 7 years ago

    I simply listed the tags that have the name-surrogate bit set out of those defined in km\ntifs.h.

    To keeps things simple it might be better to only include Microsoft tags (i.e. bit 31 is set). That way we don't have to deal with REPARSE_GUID_DATA_BUFFER struct that's used from non-Microsoft reparse points.

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 7 years ago

    FWIW, the only name-surrogate tags in the user-mode SDK headers (specifically winnt.h) are IO_REPARSE_TAG_MOUNT_POINT and IO_REPARSE_TAG_SYMLINK, as of at least the Windows 8.1 SDK.

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 7 years ago

    Here's a new patch: now, _Py_attribute_data_to_stat and Py_DeleteFileW will just use the IsReparseTagNameSurrogate macro to determine if the file is a link, so os.walk etc. will know not to follow them. os.readlink, however, will only work with junctions and symbolic links; otherwise it will raise ValueError with "unsupported reparse tag".

    This way, there's a basic level of support for all name-surrogate tags, but os.readlink only works with the ones whose internal structure is (semi-) documented.

    7e6c45ab-cd55-40b9-9429-49bb488ca339 commented 7 years ago

    New patch with spaces instead of tabs

    eryksun commented 7 years ago

    Craig, can you add a patch for bpo-29248, including a test based on the "All Users" link?

    ce401e07-99a9-4a03-a2dd-aee869fc1c76 commented 5 years ago

    I can confirm the os.walk() behavior still exists on 3.8. Just curious on the status of the patch?

    zooba commented 4 years ago

    At a minimum, it needs to be turned into a GitHub PR.

    We've made some significant changes in this area in 3.8, so possibly the best available code is now in shutil.rmtree (or shutil.copytree) rather than the older patch files.

    eryksun commented 3 years ago

    Windows implements filesystem symlinks and mountpoints as name-surrogate reparse points. Python 3.8 introduced behavior changes to how reparse points are supported, but the stat st_mode value still sets S_IFLNK only for actual symlinks, not for mountpoints. This ensures that if os.path.islink() is true, it's safe to read its target and copy it via os.readlink() and os.symlink().

    A mountpoint is not equivalent to a symlink in a few cases, so it shouldn't always be handled the same or copied as a symlink. The major difference is that mountpoints in a remote path are evaluated by the server, whereas symlinks in a remote path are evaluated by the client. Also, during path parsing, the target of a symlink replaces the opened path, but mountpoints are retained in the opened path (except if the target path contains a symlink, but that's broken in remote paths and should be avoided). This means that relative ".." components and rooted paths in a relative symlink target will traverse a mountpoint as if it's just a directory in the opened path. That's an important distinction, but in practice I'd steer someone away from relying on it, especially if a filesystem is mounted in multiple locations (e.g. on both a DOS drive and a directory), else resolution of the symlink will depend on which mountpoint is used.

    It's best to handle mountpoints as if they're symlinks when deleting a tree because the way they're implemented as reparse points doesn't prevent loops. However, when walking a tree, you may or may not want to traverse a mountpoint. If it's traversed, a seen set() can be used to remember previously traversed directories, in order to prevent loops. As Steve mentioned, look to the implementation of shutil.rmtree() as an example.

    However, don't look to shutil.copytree() since it's wrong. The is_symlink() method of a scandir() entry is only true for an actual symlink, not a mountpoint, so the extra check that copytree() does is redundant. I think it was left in by mistake when the plan was to handle mountpoints as symlinks. It would be nice if we could copy a mountpoint instead of traversing it in copytree(), but the private implementation of _winapi.CreateJunction() isn't well-behaved and tested enough to be promoted into the standard library as something like os.mount().

    Chaoses-Ib commented 2 years ago

    How about adding a tip to the document of os.path.islink() and os.walk()? This unexpected behavior could lead to serious problems. For example, BleachBit, a disk space cleaner written in Python, wound follow junctions in the Recycle Bin by calling os.walk() and delete all the files it found, causing some users to lose data until the developers found the problem and override os.path.islink() (and there was an unlucky user using the old version lost data yesterday). Symlinks and junctions probably shouldn't be treated the same way, but it wouldn't hurt to add a tip for that.

    serhiy-storchaka commented 11 months ago

    The better workaround when use os.walk() with topdown=True (by default) is to checks the dirs list yielded by os.walk() and remove junction points from it.

    But this case looks important enough to support it in the stdlib automatically. Should we add the follow_junctions option in os.walk()? What should be its default value, True for compatibility with old code or False for safety?

    zooba commented 11 months ago

    it wouldn't hurt to add a tip for that

    True, though it's very likely that readers would not find it. As a rule, symlinks != junctions, and because os is based on POSIX, it always means symlinks, and we probably say that somewhere. But we don't list next to every possible use of symlinks that it doesn't include junctions (or even that it may use symlinks, for that matter), and I don't think we should start.

    Using os.path.isjunction when it exists is a good recommendation. And using os.walk directly is about the only approach.

    Should we add the follow_junctions option in os.walk()? What should be its default value, True for compatibility with old code or False for safety?

    I'm -0 on adding the argument,[^1] but if we do, I'm +100 on compatibility by default.

    [^1]: It's impossible to detect new arguments with code, so users have to add version checks if their code spans multiple releases. And I don't like version checks - code should be able to be version-naive with respect to the runtime.

    doctorpangloss commented 9 months ago

    So what is the example snippet that will correctly not walk through junctions?

    zooba commented 9 months ago

    Off the top of my head:

    for root, dirs, files in os.walk(SEARCH_ROOT):
        dirs[:] = [d for d in dirs if not os.path.isjunction(os.path.join(root, d))]
        for f in files:
            # whatever normal processing you would do

    (Removing names from the dirs value skips recursing into them later.)