Change shutil.rmtree and os.walk to support very deep hierarchies

ad83914d-c8e2-4b82-8a39-fccff147f12c commented 3 years ago

BPO	45564

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['library', '3.9', 'type-crash'] title = 'shutil.rmtree and os.walk are implemented using recursion, fail on deep hierarchies' updated_at = user = 'https://bugs.python.org/AlexanderPatrakov' ``` bugs.python.org fields: ```python activity = actor = 'Alexander.Patrakov' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Alexander.Patrakov' dependencies = [] files = [] hgrepos = [] issue_num = 45564 keywords = [] message_count = 1.0 messages = ['404687'] nosy_count = 1.0 nosy_names = ['Alexander.Patrakov'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'crash' url = 'https://bugs.python.org/issue45564' versions = ['Python 3.9'] ```

Linked PRs

gh-99803
gh-100282
gh-100347
gh-100671
gh-100703
gh-103164
gh-103234
gh-103566
gh-119634
gh-119638
gh-119748
gh-119749
gh-119764
gh-119765
gh-119766
gh-119767
gh-119768
gh-119808
gh-119918
gh-119919

ad83914d-c8e2-4b82-8a39-fccff147f12c commented 3 years ago

It is possible to create deep directory hierarchies that cannot be removed via shutil.rmtree or walked via os.walk, because these functions exceed the interpreter recursion limit. This may have security implications for web services (e.g. various webdisks) that have to clean up user-created mess or walk through it.

[aep@aep-haswell ~]$ mkdir /tmp/badstuff
[aep@aep-haswell ~]$ cd /tmp/badstuff
[aep@aep-haswell badstuff]$ for x in `seq 2048` ; do mkdir $x ; cd $x ; done
[aep@aep-haswell 103]$ cd
[aep@aep-haswell ~]$ python
Python 3.9.7 (default, Oct 10 2021, 15:13:22) 
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shutil
>>> shutil.rmtree('/tmp/badstuff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/shutil.py", line 726, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  [Previous line repeated 992 more times]
  File "/usr/lib/python3.9/shutil.py", line 642, in _rmtree_safe_fd
    fullname = os.path.join(path, entry.name)
  File "/usr/lib/python3.9/posixpath.py", line 77, in join
    sep = _get_sep(a)
  File "/usr/lib/python3.9/posixpath.py", line 42, in _get_sep
    if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
>>> import os
>>> list(os.walk('/tmp/badstuff'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/os.py", line 418, in _walk
    yield from _walk(new_path, topdown, onerror, followlinks)
  File "/usr/lib/python3.9/os.py", line 418, in _walk
    yield from _walk(new_path, topdown, onerror, followlinks)
  File "/usr/lib/python3.9/os.py", line 418, in _walk
    yield from _walk(new_path, topdown, onerror, followlinks)
  [Previous line repeated 993 more times]
  File "/usr/lib/python3.9/os.py", line 412, in _walk
    new_path = join(top, dirname)
  File "/usr/lib/python3.9/posixpath.py", line 77, in join
    sep = _get_sep(a)
  File "/usr/lib/python3.9/posixpath.py", line 42, in _get_sep
    if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
>>>

barneygale commented 2 years ago

This also affects pathlib.Path.walk()

merwok commented 1 year ago

I am unsure on the bug vs feature classification here. On one side it seems like a clear problem and the fixes don’t change function signatures or the way they are used, but on the other side there is no guarantee that extremely deep hierachies are supported, and I wonder if there could be negative effects from the ocde changes (for example, can the stack list/deque grow extremely large during traversal?).

(Adding PR reviewers: @carljm @brettcannon @serhiy-storchaka)

carljm commented 1 year ago

I don't have strong feelings about calling it a bug vs a feature; I would tend to call it a bug because the function just fails in a subset of cases that aren't obviously outside its domain. I agree that it's a bug that we could just document as a known limitation, though I don't see any reason to do that when the fix is not that difficult. I guess the only real impact of how it's classified is whether the fix would be backported?

I wonder if there could be negative effects from the ocde changes (for example, can the stack list/deque grow extremely large during traversal?)

Sure, but in the current code the Python stack would instead grow very large in the same scenario. And the Python stack frames (plus everything they reference) are almost certainly larger than the stack elements tracked in the iterative version, so if anything I would expect the iterative version to also save memory in the case of a very deep traversal.

merwok commented 1 year ago

Yes exactly, the impact of my question is backporting or not. Thanks for the reply about the memory concern!

I take it you feel ok about backporting the fix; let’s wait to see what a core dev thinks.

brettcannon commented 1 year ago

I would classify this as a feature request. There are plenty of things in CPython for which you can you run out of stack space for and it isn't considered a bug in those instances (and that's just part of general limitations that CPython has).

carljm commented 1 year ago

(To be clear, I don't think there's a strong case for backporting this, so reasoning backwards from the conclusion I'm quite happy to call it a feature :P )

barneygale commented 1 year ago

Are glob.glob(recursive=True) and pathlib.Path.rglob() also affected?

jonburdo commented 1 year ago

Are glob.glob(recursive=True) and pathlib.Path.rglob() also affected?

Yes, it looks like it. _rlistdir in Lib/os.py and _RecursiveWildcardSelector in Lib/pathlib.py could be changed to fix it

barneygale commented 1 year ago

Some previous discussion in #70732.

barneygale commented 1 year ago

Am I right in thinking that os.walk(followlinks=True) can now consume all available memory if a symlink loop is encountered? Previously it would raise RecursionError.

barneygale commented 1 year ago

I'm not convinced we should implement this feature request.

A well-designed portable application shouldn't create very deep directory trees. For one thing, you'll hit MAX_PATH shenanigans on Windows long before you hit the Python recursion limit.

The remaining use case is where something has gone wrong, e.g. buggy code has created nonsense trees, or a malicious archive has been extracted. That seems sufficiently niche that calling sys.setrecursionlimit() before shutil.rmtree() is a reasonable solution.

jonburdo commented 1 year ago

I'm not convinced we should implement this feature request.

I understand the hesitation, but I think it's worth it and have some thoughts in addition to comments in support in this discussion thread

A well-designed portable application shouldn't create very deep directory trees. For one thing, you'll hit MAX_PATH shenanigans on Windows long before you hit the Python recursion limit.

Applications are not the only things that create directory trees. For many applications, portability is not a design goal, and many applications that are used are not well-designed. I'd also argue that your point highlights that shutil.rmtree is not currently portable itself. By default, it can delete basically any normal directory tree on some systems but not on others.

The remaining use case is where something has gone wrong, e.g. buggy code has created nonsense trees, or a malicious archive has been extracted. That seems sufficiently niche that calling sys.setrecursionlimit() before shutil.rmtree() is a reasonable solution.

I think deep directory trees are less common, but I wouldn't say wanting to delete them when they occur is niche.

A few other thoughts:

In any situation where the call stack is already close to the recursion limit, you don't need a deep directory tree in order to hit the recursion limit. A function might recursively traverses a graph and then call os.walk or shutil.rmtree.
Recursion is an implementation detail which is not exposed to the user in the documentation for all of these functions, but limits the user in unexpected ways. For shutil.rmtree in particular, there's no implication that recursion is even used, and I was surprised to find that it was. Python users may easily have no idea that a recursion limit exists.
I think the speed improvements alone are significant enough to warrant these changes (over 6-7% in my os.walk benchmarks)

In my view the cost in terms of complexity is very low, and the benefit is significant. Any significant downsides I might be missing?

JelleZijlstra commented 1 year ago

Am I right in thinking that os.walk(followlinks=True) can now consume all available memory if a symlink loop is encountered? Previously it would raise RecursionError.

This seems like an important concern, @Ovsyanka83 could you comment on this situation?

zmievsa commented 1 year ago

I am currently gathering info on this issue. Will reply within ~2 hours with a short report on OS X, Windows, and Linux.

zmievsa commented 1 year ago

I have created a symlink loop on 3 OSes and tested how they act. I am obviously no expert and my results are obviously not covering the entire playing field but they might help us understand just how bad the situation is.

On all three OSes an OSError gets raised and caught on the line of code attached.

On manjaro linux (kernel: 6.0.11-1, filesystem: ext4), I get an OSError(40, 'Too many levels of symbolic links') after walk yields 41 times.
On windows 10, I get an OSError(22, 'The name of the file cannot be resolved by the system' after walk yields 32 times. (I believe the main reason is the maximum path length and that we could get more yields if I started from C: but that does not affect our results).
On OS X Monterey, I get an OSError(62, 'Too many levels of symbolic links') after walk yields 33 times.

Seems like each OS is preventing us from shooting ourselves in the knee so I would not worry too much. However, that only covers some platforms where python is used. What about others?

Essentially my point is: if a platform does not have any protections against denial-of-service by symlink loops, then we are in trouble. However, I do not know of such platforms and all resources I visited never mentioned such platforms.

barneygale commented 1 year ago

Thank you @jonburdo and @Ovsyanka83, you've put my mind at ease :-)

pochmann commented 1 year ago

I think you changed os.walk's behavior. The recursive one supported late modifications of subdirs lists, due to being lazier. The iterative one doesn't (if I'm not mistaken... I can't test).

With topdown, you can modify a directory's subdirs, for example remove one and it won't be visited. Usually you'd do that while your walk is on that parent directory. But with the recursive one, it was possible to do it later. With the iterative one, I don't think that works anymore.

Here's a demo where I remove the second subdir after having walked onto the first:

import os

# Create demo dir with three subdirs
os.makedirs('demo/a')
os.makedirs('demo/b')
os.makedirs('demo/c')

# Walk onto "demo"
walk = os.walk('demo')
demo = next(walk)
print(demo)

# Walk onto first subdir
first = next(walk)
print(first)

# Remove the second subdir
del demo[1][1]

# Walk onto the remaining subdirs
for x in walk:
    print(x)

Output (Try it online!), note that the second subdir wasn't walked:

('demo', ['c', 'a', 'b'], [])
('demo/c', [], [])
('demo/b', [], [])

Excerpt from the recursive one:

        for dirname in dirs:
            new_path = join(top, dirname)
            yield from _walk(new_path, topdown, onerror, followlinks)

The iteration of dirs and the walking of the subdirs are intertwined. After walking a subdir, the paused iteration of dirs resumes. That allows the late modifications of dirs to have an effect.

Excerpt from the iterative one:

            for dirname in reversed(dirs):
                new_path = join(top, dirname)
                stack.append(new_path)

This eagerly puts all subdirs onto the stack, before they're getting walked, and then dirs is never used again. So modifying dirs during the subdirs walking doesn't have an effect anymore.

barneygale commented 1 year ago

Is that used in practice? The documentation seems to imply that dirnames should only be modified before iteration resumes (my emphasis):

When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.

pochmann commented 1 year ago

I don't know whether it's used. But someone might, and I think justifiably, since:

arguments, click to see

- The way that doc part is written, I think the part you emphasized only applies to *"to inform walk() about directories the caller creates or renames"*. - The first half of that doc part looks like the "specification" of the feature, and doesn't say the modifications need to be done before continuing. - The second half of that doc part looks like *examples*, listing a few things the feature "can" be used for. It doesn't say that it *can't* be used for anything else. One other thing that's not listed but that I'd expect to work is to *duplicate* entries so they get walked multiple times.

That said, I agree it's unusual and wouldn't be surprised if nobody did that.

If we do want to restore the old behavior, I think we could instead put an iterator (forward, not reversed) on the stack:

            stack.append((
                new_path
                for dirname in dirs
                for new_path in [join(top, dirname)]
                if followlinks or not islink(new_path)
            ))

And when taking an element from the stack and it's an iterator, then get its next element. The loop in the bottom-up case could then be replaced with just stack.append(iter(walk_dirs)).

zmievsa commented 1 year ago

Honestly, I am quite baffled by your discovery. I agree that changing even small implementation details is dangerous on such an old function. However, I also would never think about using walk this way so I will leave the decision to the adults here :)

I do not think that this affects our changes for pathlib.Path.walk though because no old code is going to be broken.

Thank you for finding this out and notifying us!

P.S. If we do decide to agree with you on this and implement "late dirnames modifications", then we must also add this as a test case to prevent any breaking changes in the future.

jonburdo commented 1 year ago

It might have resulted in cleaner, more predictable behavior if os.walk had always been designed to copy dirs right before beginning to iterate over it, preventing this late editing. Editing lists while iterating over them is generally discouraged for good reason. Note that in this example with recursive os.walk, you get the same results if you change del demo[1][1] to del demo[1][0].

But this has existed for a while, and people may rely on it, so it would probably be ideal to maintain the old behavior.

I looked into this a bit and your suggestions @pochmann :

            stack.append((
                new_path
                for dirname in dirs
                for new_path in [join(top, dirname)]
                if followlinks or not islink(new_path)
            ))
Something like this should work, although note that top will be changed between calls to next on this generator, which causes problems. We need the same value that top is set to when this generator is created. Something like this could work although it's pretty ugly:
stack.append((
new_path
for top, dirs in [(top, dirs)]
for dirname in dirs
for new_path in [join(top, dirname)]
if followlinks or not islink(new_path)
))
It might also be better to just do stack.append((top, dirs)) and handle the joining and link logic when these values are used from the stack.

pochmann commented 1 year ago

note that top will be changed between calls to next on this generator, [...] Something like this could work although it's pretty ugly

Another option, with itertools (untested, having computer problems):

            new_paths = map(join, repeat(top), dirs)
            stack.append(
                new_paths if followlinks else
                filterfalse(islink, new_paths)
            )

I see you have bigger changes in mind in your PR. Will look later.

jonburdo commented 1 year ago

Is there any consensus on this issue of whether or not to support this old behavior of supporting late editing of dirs?

zmievsa commented 1 year ago

I would vote against it. But I do not think there is a clear consensus. We do not have any proof for or against the hypothesis that someone uses this behavior. And even if someone does, it feels more like a bug that we have not fixed for years.

jonburdo commented 1 year ago

And even if someone does, it feels more like a bug that we have not fixed for years.

Yeah, I think I'd agree. I'll leave the PR to support it ( https://github.com/python/cpython/pull/100703 ) open for now.

I also converted to a draft my PR for an iterative os.fwalk ( https://github.com/python/cpython/pull/100347 ), since this will introduce the same situation.

barneygale commented 1 year ago

Thanks for your contributions, everyone. I've logged the late edits issue separately:

102932

Is there anything left to do here?

jonburdo commented 1 year ago

Is there anything left to do here?

shutil.rmtree is still recursive. I was thinking of giving that a shot when I have time.

I also started a PR for os.fwalk linked here, but marked it as a draft, since I wasn't sure what the decision on handling late edits would be. I'm happy to revisit it if it seems worthwhile. That would at least make os.fwalk consistent with os.walk.

Should those be tracked in separate issues?

barneygale commented 1 year ago

shutil.rmtree is still recursive. I was thinking of giving that a shot when I have time.

I also started a PR for os.fwalk linked here, but marked it as a draft, since I wasn't sure what the decision on handling late edits would be. I'm happy to revisit it if it seems worthwhile. That would at least make os.fwalk consistent with os.walk.

Should those be tracked in separate issues?

I think it's sensible to make fwalk() consistent with walk(), even if it's affected by #102932. Do you think we could then implement rmtree() atop walk() / fwalk()?

Both problems should be tracked here, I think, because they directly relate to the original report.

jonburdo commented 1 year ago

Okay, makes sense. I'll revisit that os.fwalk PR.

Do you think we could then implement rmtree() atop walk() / fwalk()?

I'm not sure right off. I'll look into it.

jonburdo commented 1 year ago

Following up on implementing rmtree with walk / fwalk, there are some os.DirEntry calls, entry.is_junction() and entry.is_symlink(). These probably could be done with os.path functions, but I'm not sure how much of a difference this would make. It still might just be more straightforward either way to have rmtree use its own stack rather than using walk / fwalk.

It seems like a walk function that returns the os.DirEntry objects in the dirs list would be nice - both for this case and maybe as a user-facing function. Even if the status of entry.stat() cache is unknown, it would still be useful. Any thoughts? See related topic: https://discuss.python.org/t/get-direntry-objects-collected-during-os-walk/8143

barneygale commented 1 year ago

Following up on implementing rmtree with walk / fwalk, there are some os.DirEntry calls, entry.is_junction() and entry.is_symlink(). These probably could be done with os.path functions, but I'm not sure how much of a difference this would make.

I think the entry.is_symlink() call can be replaced with os.path.islink() quite readily. It's there's to address a timing issue, and arguably shouldn't rely on the entry at all.

The entry.is_junction() call is interesting. Perhaps walk() should treat junctions like symlinks, which seems to be the behaviour in some other os functions. Alternatively we could add a followjunctions argument to os.walk() (and fwalk()).

I hereby summon @eryksun for his Windows filesystem expertise!

It seems like a walk function that returns the os.DirEntry objects in the dirs list would be nice - both for this case and maybe as a user-facing function.

Maybe. There's a really high threshold to meet for additions to CPython, and I'd like to see if we can make walk() work first.

barneygale commented 1 year ago

Going deeper, should os.DirEntry.is_dir(follow_symlinks=False) return False for junctions, as it would for symlinks to directories?

eryksun commented 1 year ago

POSIX has no file type that corresponds to a Windows filesystem junction, which is a name-surrogate reparse point that's tagged IO_REPARSE_TAG_MOUNT_POINT. POSIX implements mountpoints differently, using a centralized table, whereas Windows uses a hybrid system that's partly decentralized using reparse points. Python generally handles junctions as directories, such as when copying (e.g. shutil.copytree()). However, junctions are special cased to be handled like symlinks when deleting (e.g. os.unlink() and shutil.rmtree()). Windows shells such as CMD and Explorer handle junctions similarly.

os.unlink() special cases directory symlinks and junctions by calling WinAPI RemoveDirectoryW(). The latter has special handling when deleting a junction that targets the root path of a volume GUID device name (e.g. "\??\Volume{12345678-1234-1234-1234-123456781234}\", i.e. a volume mount point). In this case, it calls DeleteVolumeMountPointW(), which attempts to remove the mountpoint from the system's mountpoint database. The latter is the basis for the canonical DOS path of a volume, as returned by GetFinalPathNameByHandleW(), so it's important to keep it accurate. (If the caller lacks administrator access, updating the mountpoint database will fail, but the directory might still get deleted. To avoid this, volume mountpoints should be secured to only allow deletion by an administrator.)

In some respects, a junction is similar to a bind mountpoint on POSIX. Of note here is that the path parsing behaviors of mountpoints and symlinks are distinctly different. For example, when opening "C:\Mount\mountpoint\spamlink", the system uses "C:\Mount\mountpoint" as the opened path when resolving "spamlink". If, for example, "spamlink" targets "..\spam", the system will resolve it as "C:\Mount\spam". In other words, "mountpoint" is handled like a POSIX directory mount point, not like a symlink. This is similar to how "/mnt/mountpoint/spamlink" would be resolved on POSIX as "/mnt/spam". On the other hand, when opening "C:\Links\symlink\spamlink", the resolved path of "spamlink" will depend on the resolved path of "symlink", which could be a path on another device or even on another machine.

The other distinct difference with junctions is that they're always resolved on the machine where they reside. The system enforces this by requiring the target of a junction to be a directory on a local filesystem. Say, for example, you have a local directory that's shared as "\\server\share" and you want to link a local directory "F:\spam" as "\\server\share\spam". You can easily implement this using a junction since reparsing the junction is handled locally on your server. Now say that a client user makes a local copy of "\\server\share". The client needs to copy the junction "\\server\share\spam" as a regular directory. It shouldn't be copied as a reparse point. At best that would be meaningless on the client system, and at worst it would be dangerously wrong if the client coincidentally has a local "F:\spam" directory.

Symlinks, on the other hand, are always resolved on the client machine. Thus if you tried to implement "\\server\share\spam" as a symlink to "F:\spam", the client would attempt to resolve the symlink as a path on the client's local drive "F:". By default the system won't even allow this (thankfully) because the default policy is to forbid evaluating remote-to-local symlinks. To avoid the dangerous nonsense of a remote-to-local symlink, you could share "F:\spam" as "\\server\spam_share" and link to that. This is sensible, but it's probably disallowed. The default symlink policy forbids evaluating remote-to-remote symlinks. (Local-to-local and local-to-remote symlinks are allowed by default.)

barneygale commented 1 year ago

Thank you Eryk, that's very useful to know.

Some possible ways forward:

We could add a followjunctions argument to os.walk(), defaulting to True. We could then re-implement rmtree() atop walk(), setting followjunctions to False
We could add a more generic walk() API that exposes the os.DirEntry objects, and then re-implement rmtree() atop that.
We could rewrite rmtree() to be iterative, but without making it use walk().

I'm keen to avoid (3) if we can. The iterative versions of these functions are more complicated than their recursive predecessors, and so I'd rather have a small number of high-quality implementations.

eryksun commented 1 year ago

Adding followjunctions=True to os.walk() works for me. By default, junctions should be traversed the same as directories, such as in shutil.copytree(). OTOH, deleting a tree would use followjunctions=False, such as in shutil.rmtree().

jonburdo commented 1 year ago

Adding followjunctions=True to os.walk() works for me

I'll give this a shot.

barneygale commented 1 year ago

I spotted a problem with using os.walk() / os.fwalk() to implement rmtree(): for historical reasons, os.walk() places symlinks to directories in the dirnames list even if followlinks is set to False.

This is a wart we removed when we added pathlib.Path.walk(). So there's a possible alternative here:

Add a fwalk() method to pathlib.Path
Add a follow_junctions argument to pathlib.Path.walk()
Implement shutil.rmtree() atop those methods.

I'm going to have a crack at this.

jonburdo commented 1 year ago

Won't this be significantly slower? On my system, with a tree with ~5000 directories under it, Path.walk takes about 20-30% longer than os.walk. shutil.rmtree takes a lot longer of course, but even a 5% slowdown seems significant.

Why not adjust os.walk to allow returning os.DirEntry objects? This could even be an internal function for now.

eryksun commented 1 year ago

os.walk() places symlinks to directories in the dirnames list even if followlinks is set to False.

Preferably dirnames would be ignored when implementing rmtree() using a bottom-up walk. The directories in that list should have been removed already. To get the desired behavior regarding symlinks/junctions, a parameter could be added to os.walk() to enable yielding non-followed symlinks/junctions in filenames instead of dirnames.

barneygale commented 1 year ago

Won't this be significantly slower? On my system, with a tree with ~5000 directories under it, Path.walk takes about 20-30% longer than os.walk. shutil.rmtree takes a lot longer of course, but even a 5% slowdown seems significant.

Yep good point, it might be too slow. I also measure Path.walk() as ~20% slower than os.walk, though my implementation of Path.fwalk() is within 5% of os.fwalk().

OTOH, Path.walk() treats symlinks exactly as required by rmtree(), unlike os.walk().

Why not adjust os.walk to allow returning os.DirEntry objects? This could even be an internal function for now.

What does the API look like?

barneygale commented 1 year ago

Preferably dirnames would be ignored when implementing rmtree() using a bottom-up walk. The directories in that list should have been removed already.

In both os.walk() and Path.walk(), if an os.DirEntry.is_dir() call succeeds but a subsequence scandir() of the same directory fails, then the directory will appear in dirnames of its parent but not be visited afterward. If rmtree() is then implemented on top, this behaviour causes test failures in test_tempfile.

Arguably Path.walk() could yield a result with empty dirnames and filenames when scandir() fails.

To get the desired behavior regarding symlinks/junctions, a parameter could be added to os.walk() to enable yielding non-followed symlinks/junctions in filenames instead of dirnames.

follow_symlinks_for_realsies? :)

eryksun commented 1 year ago

rmtree() would use a bottom-up walk instead of the default top-down walk. Thus if os.scandir() fails for a directory, it does so before dirnames gets emitted for the parent directory. Each directory would be deleted via os.rmdir(dirpath), after first removing all of the yielded filenames. At this point, subdirectories have already been visited and removed, and dirnames would be completely ignored.

follow_symlinks_for_realsies? :)

I was being serious. Maybe strict_dirnames=False.

barneygale commented 1 year ago

rmtree() would use a bottom-up walk instead of the default top-down walk. Thus if os.scandir() fails for a directory, it does so before dirnames gets emitted for the parent directory. Each directory would be deleted via os.rmdir(dirpath), after first removing all of the yielded filenames. At this point, subdirectories have already been visited and removed, and dirnames would be completely ignored.

Even with topdown=False, the function internally visits directories from top to bottom (there's no other way), and what I wrote about there being no 3-tuple yielded for un-scandir()able directories still applies.

I was being serious. Maybe strict_dirnames=False.

Sorry, I didn't mean to imply you weren't being serious, only that I thought the naming wasn't obvious and that follow_symlinks should already take care of this. strict_dirnames is OK but I'm generally not a fan of complicating a function signature to allow users to make things behave how they should always have behaved. And I don't really see the need to add this when we already have pathlib.Path.walk(), which gets symlink handling right -- in fact, I believe this improvement was one of the things Brett Cannon thought tipped the balance in favour of adding Path.walk() at all.

eryksun commented 1 year ago

what I wrote about there being no 3-tuple yielded for un-scandir()able directories still applies.

Say os.walk() calls scandir('d1'), which contains subdirectory "d2" and no files. It pushes ('d1', ['d2'], []) on the stack (a list), and then it pushes "d2" on the stack. The next iteration pops "d2" from the stack and calls os.scandir('d1/d2'). If we assume it fails, then the default onerror handler of rmtree() re-raises the exception, and rmtree() fails, as it should. If os.scandir('d1/d2') succeeds, and "d2" contains file "f", then ('d1/d2', [], ['f']) gets pushed on the stack. The latter gets yielded to rmtree(), which calls os.unlink('d1/d2/f') and then os.rmdir('d1/d2'). Finally, ('d1', ['d2'], []) gets yielded to rmtree(), which calls os.rmdir('d1'), and we're done.

barneygale commented 1 year ago

You're totally right, of course Eryk :-). I must have made a mistake when I tried that approach previously, as when I try it now I get zero test failures from test_tempfile.

jonburdo commented 1 year ago

I also measure Path.walk() as ~20% slower than os.walk, though my implementation of Path.fwalk() is within 5% of os.fwalk().

Maybe that is close enough to just use Path.fwalk(), although it's probably worth directly comparing any refactor of rmtree to the existing implementation to make sure.

Why not adjust os.walk to allow returning os.DirEntry objects? This could even be an internal function for now.

What does the API look like?

It could simply mean introducing a kwarg like direntry=False. This seems questionable to introduce a kwarg on os.walk itself, but there are a couple of options:

Introduce os._walk with direntry=True, so this isn't public-facing. Maybe os.walk just calls os._walk always with direntry=False.
More generally, implement whatever walk function would be most useful for rmtree and whatever else should make use of it (copytree maybe) as something like os._walk or os._fwalk. This might essentially be what you've already done with Path.fwalk. On the other hand, implementing a purely bottom up function that only returns os.DirEntry objects for dirs could end up being simpler and faster by enough to be worthwhile.

Either of these options could also help avoid introducing strict_dirnames=False. And whether it's used for this or not, I think Path.fwalk is good to have.

Personally I don't see an ideal solution, which can also comfortably made to be public-facing. Even if pathlib's walk function were as fast as os's , in some cases I specifically want string paths and avoiding Path creation altogether is preferable. I'd personally love to have a public facing os.walk2 or something which implements os.walk as we'd do it now if backwards compatibility weren't an issue. This is probably more extreme, ugly, confusing, etc than simply adding kwargs to os.walk though.

barneygale commented 1 year ago

What sort of thing would os._walk(direntry=True) yield? A 2-tuple of (path, entries)? A 3-tuple of (path, direntries, fileentries)? A 4-tuple of (path, dirnames, direntries, filenames)? Bearing in mind we want users to be able to influence the walk by adjusting the directory list, and distinguish between between files/directories/symlinks for the purposes of walking.

eryksun commented 1 year ago

public facing os.walk2 or something which implements os.walk as we'd do it now

If new functions are added, I suggest os.path.walk() and os.path.fwalk(). It's where I think they belong, in correspondence with pathlib.Path. I'd prefer if it yielded os.DirEntry instances for dirpath and for the items in dirnames and filenames. There's no point in throwing away the basic stat information. I'd prefer for the onerror handler to match shutil.rmtree(), with three parameters: function, path, and excinfo (or just the exception instance instead of excinfo).

jonburdo commented 1 year ago

What sort of thing would os._walk(direntry=True) yield?

It would simply replace members of dirnames and filenames with os.DirEntry objects. So this would basically be the current os.walk implementation with a two line change, replacing dirs.append(entry.name) with dirs.append(entry if direntry else entry.name) - same for nondirs. And the same could be done for fwalk.

Of course if we just want a separate implementation, public or not, then we could maybe just get rid of direntry and strict_dirnames kwargs altogether. Instead we handle symlinks properly and always return os.DirEntry objects (maybe still a string path for dirpath though).

python / cpython

Change shutil.rmtree and os.walk to support very deep hierarchies #89727

Linked PRs

102932