Open 7e6c45ab-cd55-40b9-9429-49bb488ca339 opened 9 years ago
os.walk follows Windows junctions even if followlinks is False:
>>> import os
>>> appdata = os.environ['LOCALAPPDATA']
>>> for root, dirs, files in os.walk(appdata, followlinks=False):
... print(root)
C:\Users\Test\AppData\Local C:\Users\Test\AppData\Local\Apple C:\Users\Test\AppData\Local\Apple\Apple Software Update C:\Users\Test\AppData\Local\Apple Computer C:\Users\Test\AppData\Local\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data C:\Users\Test\AppData\Local\Application Data\Apple C:\Users\Test\AppData\Local\Application Data\Apple\Apple Software Update C:\Users\Test\AppData\Local\Application Data\Apple Computer C:\Users\Test\AppData\Local\Application Data\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data\Application Data C:\Users\Test\AppData\Local\Application Data\Application Data\Apple C:\Users\Test\AppData\Local\Application Data\Application Data\Apple\Apple Software Update C:\Users\Test\AppData\Local\Application Data\Application Data\Apple Computer C:\Users\Test\AppData\Local\Application Data\Application Data\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple\Apple Software Update C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple Computer C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Apple Computer\iTunes C:\Users\Test\AppData\Local\Application Data\Application Data\Application Data\Application Data [...]
For directory symbolic links, os.walk seems to have the correct behavior. However, Windows 7 (at least) employs junctions instead of symlinks in situations like the default user profile layout, i.e. the "Application Data" junction shown above.
I also noticed that, for junctions, os.path.islink returns False but os.stat and os.lstat return different results.
To check for a link on Windows, os.walk calls ntpath.islink, which calls os.lstat. Currently the os.lstat implementation only sets S_IFLNK for symbolic links. attribute_data_to_stat could also check for junctions (IO_REPARSE_TAG_MOUNT_POINT). For consistency, os.readlink should also read junctions (rdb->MountPointReparseBuffer).
islink https://hg.python.org/cpython/file/7b493dbf944b/Lib/ntpath.py#l239
attribute_data_to_stat https://hg.python.org/cpython/file/7b493dbf944b/Modules/posixmodule.c#l1515
win_readlink https://hg.python.org/cpython/file/7b493dbf944b/Modules/posixmodule.c#l10056
REPARSE_DATA_BUFFER https://hg.python.org/cpython/file/7b493dbf944b/Modules/winreparse.h#l11
The attached patch changes _Py_attribute_data_to_stat to set S_IFLNK for both symlinks and junctions, and changes win_readlink to return the target path for junctions (IO_REPARSE_TAG_MOUNT_POINT) as well as symlinks.
I'm not sure what to do as far as adding a test--either Python needs a way to create junctions or the test needs to rely on the ones Windows creates by default.
Incidentally, the existing win_readlink doesn't always work correctly with symbolic links, either (this is from 3.5.2):
>>> import os
>>> os.readlink(r'C:\Users\All Users')
'\x00\x00f\x00\u0201\x00\x02\x00\x00\x00f\x00\x00\x00'
The problem is that PrintNameOffset is an offset in bytes, so it needs to be divided by sizeof(WCHAR) if you're going to add it to a WCHAR pointer (https://msdn.microsoft.com/en-us/library/windows/hardware/ff552012(v=vs.85).aspx). Some links still seem to work correctly because PrintNameOffset is 0. The attached patch fixes this problem also--I wasn't sure if I should open a separate issue for it.
Actually, it looks like there is already a way to create junctions and a test for them in test_os. However, it includes this line:
# Junctions are not recognized as links. self.assertFalse(os.path.islink(self.junction))
That suggests the old behavior is intentional--does anyone know why?
Updated patch with changes to Win32JunctionTests.
Junctions are not recognized as links. self.assertFalse(os.path.islink(self.junction))
If the above comment is intended as a statement of fact, then it's inconsistent with the implementation of Py_DeleteFileW ( https://hg.python.org/cpython/file/v3.6.0/Modules/posixmodule.c#l4178 ).
I opened bpo-29248 for the os.readlink bug and bpo-29250 for the inconsistency between os.path.islink and os.stat.
Handling junctions as links is new behavior, so I've changed this issue to be an enhancement for 3.7.
If the notion of a link is generalized to junctions, then maybe it should be further generalized to include all name-surrogate reparse tags 1. Currently for Microsoft tags this includes
IO_REPARSE_TAG_MOUNT_POINT (junctions)
IO_REPARSE_TAG_SYMLINK
IO_REPARSE_TAG_IIS_CACHE
For non-Microsoft tags it includes
IO_REPARSE_TAG_SOLUTIONSOFT
IO_REPARSE_TAG_OSR_SAMPLE
IO_REPARSE_TAG_QI_TECH_HSM
IO_REPARSE_TAG_MAXISCALE_HSM
The last two are outliers. HSM isn't the kind of immediate, fast access that one would expect from a symbolic link. All other HSM tags aren't categorized as name surrogates.
Can you point me toward any documentation on the additional tags you want to support? Searching for IO_REPARSE_TAG_IIS_CACHE mostly seems to yield header files that define it (and nothing at all on MSDN), and the non-Microsoft tags just yield a few results each.
(For comparison, the junction and symbolic link tags yield 10K+ results each.)
Junctions are created with each user's home directory so they exist on every Windows system, even if the user never explicitly creates them. The additional tags seem like they're far less common and much less well-documented.
I simply listed the tags that have the name-surrogate bit set out of those defined in km\ntifs.h.
To keeps things simple it might be better to only include Microsoft tags (i.e. bit 31 is set). That way we don't have to deal with REPARSE_GUID_DATA_BUFFER struct that's used from non-Microsoft reparse points.
FWIW, the only name-surrogate tags in the user-mode SDK headers (specifically winnt.h) are IO_REPARSE_TAG_MOUNT_POINT and IO_REPARSE_TAG_SYMLINK, as of at least the Windows 8.1 SDK.
Here's a new patch: now, _Py_attribute_data_to_stat and Py_DeleteFileW will just use the IsReparseTagNameSurrogate macro to determine if the file is a link, so os.walk etc. will know not to follow them. os.readlink, however, will only work with junctions and symbolic links; otherwise it will raise ValueError with "unsupported reparse tag".
This way, there's a basic level of support for all name-surrogate tags, but os.readlink only works with the ones whose internal structure is (semi-) documented.
New patch with spaces instead of tabs
Craig, can you add a patch for bpo-29248, including a test based on the "All Users" link?
I can confirm the os.walk() behavior still exists on 3.8. Just curious on the status of the patch?
At a minimum, it needs to be turned into a GitHub PR.
We've made some significant changes in this area in 3.8, so possibly the best available code is now in shutil.rmtree (or shutil.copytree) rather than the older patch files.
Windows implements filesystem symlinks and mountpoints as name-surrogate reparse points. Python 3.8 introduced behavior changes to how reparse points are supported, but the stat st_mode value still sets S_IFLNK only for actual symlinks, not for mountpoints. This ensures that if os.path.islink() is true, it's safe to read its target and copy it via os.readlink() and os.symlink().
A mountpoint is not equivalent to a symlink in a few cases, so it shouldn't always be handled the same or copied as a symlink. The major difference is that mountpoints in a remote path are evaluated by the server, whereas symlinks in a remote path are evaluated by the client. Also, during path parsing, the target of a symlink replaces the opened path, but mountpoints are retained in the opened path (except if the target path contains a symlink, but that's broken in remote paths and should be avoided). This means that relative ".." components and rooted paths in a relative symlink target will traverse a mountpoint as if it's just a directory in the opened path. That's an important distinction, but in practice I'd steer someone away from relying on it, especially if a filesystem is mounted in multiple locations (e.g. on both a DOS drive and a directory), else resolution of the symlink will depend on which mountpoint is used.
It's best to handle mountpoints as if they're symlinks when deleting a tree because the way they're implemented as reparse points doesn't prevent loops. However, when walking a tree, you may or may not want to traverse a mountpoint. If it's traversed, a seen set() can be used to remember previously traversed directories, in order to prevent loops. As Steve mentioned, look to the implementation of shutil.rmtree() as an example.
However, don't look to shutil.copytree() since it's wrong. The is_symlink() method of a scandir() entry is only true for an actual symlink, not a mountpoint, so the extra check that copytree() does is redundant. I think it was left in by mistake when the plan was to handle mountpoints as symlinks. It would be nice if we could copy a mountpoint instead of traversing it in copytree(), but the private implementation of _winapi.CreateJunction() isn't well-behaved and tested enough to be promoted into the standard library as something like os.mount().
How about adding a tip to the document of os.path.islink()
and os.walk()
? This unexpected behavior could lead to serious problems. For example, BleachBit, a disk space cleaner written in Python, wound follow junctions in the Recycle Bin by calling os.walk()
and delete all the files it found, causing some users to lose data until the developers found the problem and override os.path.islink()
(and there was an unlucky user using the old version lost data yesterday). Symlinks and junctions probably shouldn't be treated the same way, but it wouldn't hurt to add a tip for that.
The better workaround when use os.walk()
with topdown=True
(by default) is to checks the dirs
list yielded by os.walk()
and remove junction points from it.
But this case looks important enough to support it in the stdlib automatically. Should we add the follow_junctions
option in os.walk()
? What should be its default value, True
for compatibility with old code or False
for safety?
it wouldn't hurt to add a tip for that
True, though it's very likely that readers would not find it. As a rule, symlinks != junctions, and because os
is based on POSIX, it always means symlinks, and we probably say that somewhere. But we don't list next to every possible use of symlinks that it doesn't include junctions (or even that it may use symlinks, for that matter), and I don't think we should start.
Using os.path.isjunction
when it exists is a good recommendation. And using os.walk
directly is about the only approach.
Should we add the
follow_junctions
option inos.walk()
? What should be its default value,True
for compatibility with old code orFalse
for safety?
I'm -0 on adding the argument,[^1] but if we do, I'm +100 on compatibility by default.
[^1]: It's impossible to detect new arguments with code, so users have to add version checks if their code spans multiple releases. And I don't like version checks - code should be able to be version-naive with respect to the runtime.
So what is the example snippet that will correctly not walk through junctions?
Off the top of my head:
for root, dirs, files in os.walk(SEARCH_ROOT):
dirs[:] = [d for d in dirs if not os.path.isjunction(os.path.join(root, d))]
for f in files:
# whatever normal processing you would do
(Removing names from the dirs
value skips recursing into them later.)
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['3.8', '3.9', '3.10', 'type-feature', 'library', 'OS-windows']
title = 'os.walk always follows Windows junctions'
updated_at =
user = 'https://bugs.python.org/craigh'
```
bugs.python.org fields:
```python
activity =
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Windows']
creation =
creator = 'craigh'
dependencies = ['29248']
files = ['44823', '44824', '46289', '46291', '46718']
hgrepos = []
issue_num = 23407
keywords = ['patch']
message_count = 16.0
messages = ['235531', '235533', '277385', '277389', '277393', '285273', '285301', '285351', '285355', '285458', '285486', '285488', '285499', '356677', '356882', '392674']
nosy_count = 7.0
nosy_names = ['tim.golden', 'craigh', 'zach.ware', 'eryksun', 'steve.dower', 'jamercee', 'eric.fahlgren']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue23407'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']
```