Open ad83914d-c8e2-4b82-8a39-fccff147f12c opened 3 years ago
It is possible to create deep directory hierarchies that cannot be removed via shutil.rmtree or walked via os.walk, because these functions exceed the interpreter recursion limit. This may have security implications for web services (e.g. various webdisks) that have to clean up user-created mess or walk through it.
[aep@aep-haswell ~]$ mkdir /tmp/badstuff
[aep@aep-haswell ~]$ cd /tmp/badstuff
[aep@aep-haswell badstuff]$ for x in `seq 2048` ; do mkdir $x ; cd $x ; done
[aep@aep-haswell 103]$ cd
[aep@aep-haswell ~]$ python
Python 3.9.7 (default, Oct 10 2021, 15:13:22)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shutil
>>> shutil.rmtree('/tmp/badstuff')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/shutil.py", line 726, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/usr/lib/python3.9/shutil.py", line 663, in _rmtree_safe_fd
_rmtree_safe_fd(dirfd, fullname, onerror)
[Previous line repeated 992 more times]
File "/usr/lib/python3.9/shutil.py", line 642, in _rmtree_safe_fd
fullname = os.path.join(path, entry.name)
File "/usr/lib/python3.9/posixpath.py", line 77, in join
sep = _get_sep(a)
File "/usr/lib/python3.9/posixpath.py", line 42, in _get_sep
if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
>>> import os
>>> list(os.walk('/tmp/badstuff'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/os.py", line 418, in _walk
yield from _walk(new_path, topdown, onerror, followlinks)
File "/usr/lib/python3.9/os.py", line 418, in _walk
yield from _walk(new_path, topdown, onerror, followlinks)
File "/usr/lib/python3.9/os.py", line 418, in _walk
yield from _walk(new_path, topdown, onerror, followlinks)
[Previous line repeated 993 more times]
File "/usr/lib/python3.9/os.py", line 412, in _walk
new_path = join(top, dirname)
File "/usr/lib/python3.9/posixpath.py", line 77, in join
sep = _get_sep(a)
File "/usr/lib/python3.9/posixpath.py", line 42, in _get_sep
if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
>>>
This also affects pathlib.Path.walk()
I am unsure on the bug vs feature classification here. On one side it seems like a clear problem and the fixes don’t change function signatures or the way they are used, but on the other side there is no guarantee that extremely deep hierachies are supported, and I wonder if there could be negative effects from the ocde changes (for example, can the stack
list/deque grow extremely large during traversal?).
(Adding PR reviewers: @carljm @brettcannon @serhiy-storchaka)
I don't have strong feelings about calling it a bug vs a feature; I would tend to call it a bug because the function just fails in a subset of cases that aren't obviously outside its domain. I agree that it's a bug that we could just document as a known limitation, though I don't see any reason to do that when the fix is not that difficult. I guess the only real impact of how it's classified is whether the fix would be backported?
I wonder if there could be negative effects from the ocde changes (for example, can the stack list/deque grow extremely large during traversal?)
Sure, but in the current code the Python stack would instead grow very large in the same scenario. And the Python stack frames (plus everything they reference) are almost certainly larger than the stack elements tracked in the iterative version, so if anything I would expect the iterative version to also save memory in the case of a very deep traversal.
Yes exactly, the impact of my question is backporting or not. Thanks for the reply about the memory concern!
I take it you feel ok about backporting the fix; let’s wait to see what a core dev thinks.
I would classify this as a feature request. There are plenty of things in CPython for which you can you run out of stack space for and it isn't considered a bug in those instances (and that's just part of general limitations that CPython has).
(To be clear, I don't think there's a strong case for backporting this, so reasoning backwards from the conclusion I'm quite happy to call it a feature :P )
Are glob.glob(recursive=True)
and pathlib.Path.rglob()
also affected?
Are
glob.glob(recursive=True)
andpathlib.Path.rglob()
also affected?
Yes, it looks like it. _rlistdir
in Lib/os.py
and _RecursiveWildcardSelector
in Lib/pathlib.py
could be changed to fix it
Some previous discussion in #70732.
Am I right in thinking that os.walk(followlinks=True)
can now consume all available memory if a symlink loop is encountered? Previously it would raise RecursionError
.
I'm not convinced we should implement this feature request.
A well-designed portable application shouldn't create very deep directory trees. For one thing, you'll hit MAX_PATH
shenanigans on Windows long before you hit the Python recursion limit.
The remaining use case is where something has gone wrong, e.g. buggy code has created nonsense trees, or a malicious archive has been extracted. That seems sufficiently niche that calling sys.setrecursionlimit()
before shutil.rmtree()
is a reasonable solution.
I'm not convinced we should implement this feature request.
I understand the hesitation, but I think it's worth it and have some thoughts in addition to comments in support in this discussion thread
A well-designed portable application shouldn't create very deep directory trees. For one thing, you'll hit
MAX_PATH
shenanigans on Windows long before you hit the Python recursion limit.
Applications are not the only things that create directory trees. For many applications, portability is not a design goal, and many applications that are used are not well-designed. I'd also argue that your point highlights that shutil.rmtree
is not currently portable itself. By default, it can delete basically any normal directory tree on some systems but not on others.
The remaining use case is where something has gone wrong, e.g. buggy code has created nonsense trees, or a malicious archive has been extracted. That seems sufficiently niche that calling
sys.setrecursionlimit()
beforeshutil.rmtree()
is a reasonable solution.
I think deep directory trees are less common, but I wouldn't say wanting to delete them when they occur is niche.
A few other thoughts:
In any situation where the call stack is already close to the recursion limit, you don't need a deep directory tree in order to hit the recursion limit. A function might recursively traverses a graph and then call os.walk
or shutil.rmtree
.
Recursion is an implementation detail which is not exposed to the user in the documentation for all of these functions, but limits the user in unexpected ways. For shutil.rmtree
in particular, there's no implication that recursion is even used, and I was surprised to find that it was. Python users may easily have no idea that a recursion limit exists.
I think the speed improvements alone are significant enough to warrant these changes (over 6-7% in my os.walk
benchmarks)
In my view the cost in terms of complexity is very low, and the benefit is significant. Any significant downsides I might be missing?
Am I right in thinking that
os.walk(followlinks=True)
can now consume all available memory if a symlink loop is encountered? Previously it would raiseRecursionError
.
This seems like an important concern, @Ovsyanka83 could you comment on this situation?
I am currently gathering info on this issue. Will reply within ~2 hours with a short report on OS X, Windows, and Linux.
I have created a symlink loop on 3 OSes and tested how they act. I am obviously no expert and my results are obviously not covering the entire playing field but they might help us understand just how bad the situation is.
On all three OSes an OSError gets raised and caught on the line of code attached.
OSError(40, 'Too many levels of symbolic links')
after walk yields 41 times.OSError(22, 'The name of the file cannot be resolved by the system'
after walk yields 32 times. (I believe the main reason is the maximum path length and that we could get more yields if I started from C:
but that does not affect our results).OSError(62, 'Too many levels of symbolic links')
after walk yields 33 times.Seems like each OS is preventing us from shooting ourselves in the knee so I would not worry too much. However, that only covers some platforms where python is used. What about others?
Essentially my point is: if a platform does not have any protections against denial-of-service by symlink loops, then we are in trouble. However, I do not know of such platforms and all resources I visited never mentioned such platforms.
Thank you @jonburdo and @Ovsyanka83, you've put my mind at ease :-)
I think you changed os.walk
's behavior. The recursive one supported late modifications of subdirs lists, due to being lazier. The iterative one doesn't (if I'm not mistaken... I can't test).
With topdown, you can modify a directory's subdirs, for example remove one and it won't be visited. Usually you'd do that while your walk is on that parent directory. But with the recursive one, it was possible to do it later. With the iterative one, I don't think that works anymore.
Here's a demo where I remove the second subdir after having walked onto the first:
import os
# Create demo dir with three subdirs
os.makedirs('demo/a')
os.makedirs('demo/b')
os.makedirs('demo/c')
# Walk onto "demo"
walk = os.walk('demo')
demo = next(walk)
print(demo)
# Walk onto first subdir
first = next(walk)
print(first)
# Remove the second subdir
del demo[1][1]
# Walk onto the remaining subdirs
for x in walk:
print(x)
Output (Try it online!), note that the second subdir wasn't walked:
('demo', ['c', 'a', 'b'], [])
('demo/c', [], [])
('demo/b', [], [])
Excerpt from the recursive one:
for dirname in dirs:
new_path = join(top, dirname)
yield from _walk(new_path, topdown, onerror, followlinks)
The iteration of dirs
and the walking of the subdirs are intertwined. After walking a subdir, the paused iteration of dirs
resumes. That allows the late modifications of dirs
to have an effect.
Excerpt from the iterative one:
for dirname in reversed(dirs):
new_path = join(top, dirname)
stack.append(new_path)
This eagerly puts all subdirs onto the stack, before they're getting walked, and then dirs
is never used again. So modifying dirs
during the subdirs walking doesn't have an effect anymore.
Is that used in practice? The documentation seems to imply that dirnames should only be modified before iteration resumes (my emphasis):
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
I don't know whether it's used. But someone might, and I think justifiably, since:
That said, I agree it's unusual and wouldn't be surprised if nobody did that.
If we do want to restore the old behavior, I think we could instead put an iterator (forward, not reversed) on the stack:
stack.append((
new_path
for dirname in dirs
for new_path in [join(top, dirname)]
if followlinks or not islink(new_path)
))
And when taking an element from the stack and it's an iterator, then get its next element. The loop in the bottom-up case could then be replaced with just stack.append(iter(walk_dirs))
.
Honestly, I am quite baffled by your discovery. I agree that changing even small implementation details is dangerous on such an old function. However, I also would never think about using walk
this way so I will leave the decision to the adults here :)
I do not think that this affects our changes for pathlib.Path.walk
though because no old code is going to be broken.
Thank you for finding this out and notifying us!
P.S. If we do decide to agree with you on this and implement "late dirnames modifications", then we must also add this as a test case to prevent any breaking changes in the future.
It might have resulted in cleaner, more predictable behavior if os.walk
had always been designed to copy dirs
right before beginning to iterate over it, preventing this late editing. Editing lists while iterating over them is generally discouraged for good reason. Note that in this example with recursive os.walk
, you get the same results if you change del demo[1][1]
to del demo[1][0]
.
But this has existed for a while, and people may rely on it, so it would probably be ideal to maintain the old behavior.
I looked into this a bit and your suggestions @pochmann :
stack.append(( new_path for dirname in dirs for new_path in [join(top, dirname)] if followlinks or not islink(new_path) ))
Something like this should work, although note that
top
will be changed between calls tonext
on this generator, which causes problems. We need the same value thattop
is set to when this generator is created. Something like this could work although it's pretty ugly:stack.append(( new_path for top, dirs in [(top, dirs)] for dirname in dirs for new_path in [join(top, dirname)] if followlinks or not islink(new_path) ))
It might also be better to just do
stack.append((top, dirs))
and handle the joining and link logic when these values are used from the stack.
note that
top
will be changed between calls tonext
on this generator, [...] Something like this could work although it's pretty ugly
Another option, with itertools (untested, having computer problems):
new_paths = map(join, repeat(top), dirs)
stack.append(
new_paths if followlinks else
filterfalse(islink, new_paths)
)
I see you have bigger changes in mind in your PR. Will look later.
Is there any consensus on this issue of whether or not to support this old behavior of supporting late editing of dirs?
I would vote against it. But I do not think there is a clear consensus. We do not have any proof for or against the hypothesis that someone uses this behavior. And even if someone does, it feels more like a bug that we have not fixed for years.
And even if someone does, it feels more like a bug that we have not fixed for years.
Yeah, I think I'd agree. I'll leave the PR to support it ( https://github.com/python/cpython/pull/100703 ) open for now.
I also converted to a draft my PR for an iterative os.fwalk
( https://github.com/python/cpython/pull/100347 ), since this will introduce the same situation.
Thanks for your contributions, everyone. I've logged the late edits issue separately:
Is there anything left to do here?
Is there anything left to do here?
shutil.rmtree
is still recursive. I was thinking of giving that a shot when I have time.
I also started a PR for os.fwalk
linked here, but marked it as a draft, since I wasn't sure what the decision on handling late edits would be. I'm happy to revisit it if it seems worthwhile. That would at least make os.fwalk
consistent with os.walk
.
Should those be tracked in separate issues?
shutil.rmtree
is still recursive. I was thinking of giving that a shot when I have time.I also started a PR for
os.fwalk
linked here, but marked it as a draft, since I wasn't sure what the decision on handling late edits would be. I'm happy to revisit it if it seems worthwhile. That would at least makeos.fwalk
consistent withos.walk
.Should those be tracked in separate issues?
I think it's sensible to make fwalk()
consistent with walk()
, even if it's affected by #102932. Do you think we could then implement rmtree()
atop walk()
/ fwalk()
?
Both problems should be tracked here, I think, because they directly relate to the original report.
Okay, makes sense. I'll revisit that os.fwalk
PR.
Do you think we could then implement
rmtree()
atopwalk()
/fwalk()
?
I'm not sure right off. I'll look into it.
Following up on implementing rmtree
with walk
/ fwalk
, there are some os.DirEntry
calls, entry.is_junction()
and entry.is_symlink()
. These probably could be done with os.path
functions, but I'm not sure how much of a difference this would make. It still might just be more straightforward either way to have rmtree
use its own stack rather than using walk
/ fwalk
.
It seems like a walk
function that returns the os.DirEntry
objects in the dirs list would be nice - both for this case and maybe as a user-facing function. Even if the status of entry.stat()
cache is unknown, it would still be useful. Any thoughts? See related topic: https://discuss.python.org/t/get-direntry-objects-collected-during-os-walk/8143
Following up on implementing
rmtree
withwalk
/fwalk
, there are someos.DirEntry
calls,entry.is_junction()
andentry.is_symlink()
. These probably could be done withos.path
functions, but I'm not sure how much of a difference this would make.
I think the entry.is_symlink()
call can be replaced with os.path.islink()
quite readily. It's there's to address a timing issue, and arguably shouldn't rely on the entry
at all.
The entry.is_junction()
call is interesting. Perhaps walk()
should treat junctions like symlinks, which seems to be the behaviour in some other os
functions. Alternatively we could add a followjunctions argument to os.walk()
(and fwalk()
).
I hereby summon @eryksun for his Windows filesystem expertise!
It seems like a
walk
function that returns theos.DirEntry
objects in the dirs list would be nice - both for this case and maybe as a user-facing function.
Maybe. There's a really high threshold to meet for additions to CPython, and I'd like to see if we can make walk()
work first.
Going deeper, should os.DirEntry.is_dir(follow_symlinks=False)
return False
for junctions, as it would for symlinks to directories?
POSIX has no file type that corresponds to a Windows filesystem junction, which is a name-surrogate reparse point that's tagged IO_REPARSE_TAG_MOUNT_POINT
. POSIX implements mountpoints differently, using a centralized table, whereas Windows uses a hybrid system that's partly decentralized using reparse points. Python generally handles junctions as directories, such as when copying (e.g. shutil.copytree()
). However, junctions are special cased to be handled like symlinks when deleting (e.g. os.unlink()
and shutil.rmtree()
). Windows shells such as CMD and Explorer handle junctions similarly.
os.unlink()
special cases directory symlinks and junctions by calling WinAPI RemoveDirectoryW()
. The latter has special handling when deleting a junction that targets the root path of a volume GUID device name (e.g. "\??\Volume{12345678-1234-1234-1234-123456781234}\", i.e. a volume mount point). In this case, it calls DeleteVolumeMountPointW()
, which attempts to remove the mountpoint from the system's mountpoint database. The latter is the basis for the canonical DOS path of a volume, as returned by GetFinalPathNameByHandleW()
, so it's important to keep it accurate. (If the caller lacks administrator access, updating the mountpoint database will fail, but the directory might still get deleted. To avoid this, volume mountpoints should be secured to only allow deletion by an administrator.)
In some respects, a junction is similar to a bind mountpoint on POSIX. Of note here is that the path parsing behaviors of mountpoints and symlinks are distinctly different. For example, when opening "C:\Mount\mountpoint\spamlink", the system uses "C:\Mount\mountpoint" as the opened path when resolving "spamlink". If, for example, "spamlink" targets "..\spam", the system will resolve it as "C:\Mount\spam". In other words, "mountpoint" is handled like a POSIX directory mount point, not like a symlink. This is similar to how "/mnt/mountpoint/spamlink" would be resolved on POSIX as "/mnt/spam". On the other hand, when opening "C:\Links\symlink\spamlink", the resolved path of "spamlink" will depend on the resolved path of "symlink", which could be a path on another device or even on another machine.
The other distinct difference with junctions is that they're always resolved on the machine where they reside. The system enforces this by requiring the target of a junction to be a directory on a local filesystem. Say, for example, you have a local directory that's shared as "\\server\share" and you want to link a local directory "F:\spam" as "\\server\share\spam". You can easily implement this using a junction since reparsing the junction is handled locally on your server. Now say that a client user makes a local copy of "\\server\share". The client needs to copy the junction "\\server\share\spam" as a regular directory. It shouldn't be copied as a reparse point. At best that would be meaningless on the client system, and at worst it would be dangerously wrong if the client coincidentally has a local "F:\spam" directory.
Symlinks, on the other hand, are always resolved on the client machine. Thus if you tried to implement "\\server\share\spam" as a symlink to "F:\spam", the client would attempt to resolve the symlink as a path on the client's local drive "F:". By default the system won't even allow this (thankfully) because the default policy is to forbid evaluating remote-to-local symlinks. To avoid the dangerous nonsense of a remote-to-local symlink, you could share "F:\spam" as "\\server\spam_share" and link to that. This is sensible, but it's probably disallowed. The default symlink policy forbids evaluating remote-to-remote symlinks. (Local-to-local and local-to-remote symlinks are allowed by default.)
Thank you Eryk, that's very useful to know.
Some possible ways forward:
os.walk()
, defaulting to True
. We could then re-implement rmtree()
atop walk()
, setting followjunctions to False
walk()
API that exposes the os.DirEntry
objects, and then re-implement rmtree()
atop that.rmtree()
to be iterative, but without making it use walk()
.I'm keen to avoid (3) if we can. The iterative versions of these functions are more complicated than their recursive predecessors, and so I'd rather have a small number of high-quality implementations.
Adding followjunctions=True
to os.walk()
works for me. By default, junctions should be traversed the same as directories, such as in shutil.copytree()
. OTOH, deleting a tree would use followjunctions=False
, such as in shutil.rmtree()
.
Adding
followjunctions=True
toos.walk()
works for me
I'll give this a shot.
I spotted a problem with using os.walk()
/ os.fwalk()
to implement rmtree()
: for historical reasons, os.walk()
places symlinks to directories in the dirnames
list even if followlinks
is set to False
.
This is a wart we removed when we added pathlib.Path.walk()
. So there's a possible alternative here:
fwalk()
method to pathlib.Path
follow_junctions
argument to pathlib.Path.walk()
shutil.rmtree()
atop those methods.I'm going to have a crack at this.
Won't this be significantly slower? On my system, with a tree with ~5000 directories under it, Path.walk
takes about 20-30% longer than os.walk
. shutil.rmtree
takes a lot longer of course, but even a 5% slowdown seems significant.
Why not adjust os.walk
to allow returning os.DirEntry
objects? This could even be an internal function for now.
os.walk()
places symlinks to directories in thedirnames
list even iffollowlinks
is set toFalse
.
Preferably dirnames
would be ignored when implementing rmtree()
using a bottom-up walk. The directories in that list should have been removed already. To get the desired behavior regarding symlinks/junctions, a parameter could be added to os.walk()
to enable yielding non-followed symlinks/junctions in filenames
instead of dirnames
.
Won't this be significantly slower? On my system, with a tree with ~5000 directories under it, Path.walk takes about 20-30% longer than
os.walk
.shutil.rmtree
takes a lot longer of course, but even a 5% slowdown seems significant.
Yep good point, it might be too slow. I also measure Path.walk()
as ~20% slower than os.walk
, though my implementation of Path.fwalk()
is within 5% of os.fwalk()
.
OTOH, Path.walk()
treats symlinks exactly as required by rmtree()
, unlike os.walk()
.
Why not adjust
os.walk
to allow returningos.DirEntry
objects? This could even be an internal function for now.
What does the API look like?
Preferably
dirnames
would be ignored when implementingrmtree()
using a bottom-up walk. The directories in that list should have been removed already.
In both os.walk()
and Path.walk()
, if an os.DirEntry.is_dir()
call succeeds but a subsequence scandir()
of the same directory fails, then the directory will appear in dirnames
of its parent but not be visited afterward. If rmtree()
is then implemented on top, this behaviour causes test failures in test_tempfile
.
Arguably Path.walk()
could yield a result with empty dirnames
and filenames
when scandir()
fails.
To get the desired behavior regarding symlinks/junctions, a parameter could be added to
os.walk()
to enable yielding non-followed symlinks/junctions infilenames
instead ofdirnames
.
follow_symlinks_for_realsies
? :)
rmtree()
would use a bottom-up walk instead of the default top-down walk. Thus if os.scandir()
fails for a directory, it does so before dirnames
gets emitted for the parent directory. Each directory would be deleted via os.rmdir(dirpath)
, after first removing all of the yielded filenames
. At this point, subdirectories have already been visited and removed, and dirnames
would be completely ignored.
follow_symlinks_for_realsies? :)
I was being serious. Maybe strict_dirnames=False
.
rmtree()
would use a bottom-up walk instead of the default top-down walk. Thus ifos.scandir()
fails for a directory, it does so beforedirnames
gets emitted for the parent directory. Each directory would be deleted viaos.rmdir(dirpath)
, after first removing all of the yieldedfilenames
. At this point, subdirectories have already been visited and removed, anddirnames
would be completely ignored.
Even with topdown=False
, the function internally visits directories from top to bottom (there's no other way), and what I wrote about there being no 3-tuple yielded for un-scandir()
able directories still applies.
I was being serious. Maybe
strict_dirnames=False
.
Sorry, I didn't mean to imply you weren't being serious, only that I thought the naming wasn't obvious and that follow_symlinks
should already take care of this. strict_dirnames
is OK but I'm generally not a fan of complicating a function signature to allow users to make things behave how they should always have behaved. And I don't really see the need to add this when we already have pathlib.Path.walk()
, which gets symlink handling right -- in fact, I believe this improvement was one of the things Brett Cannon thought tipped the balance in favour of adding Path.walk()
at all.
what I wrote about there being no 3-tuple yielded for un-
scandir()
able directories still applies.
Say os.walk()
calls scandir('d1')
, which contains subdirectory "d2" and no files. It pushes ('d1', ['d2'], [])
on the stack (a list), and then it pushes "d2" on the stack. The next iteration pops "d2" from the stack and calls os.scandir('d1/d2')
. If we assume it fails, then the default onerror
handler of rmtree()
re-raises the exception, and rmtree()
fails, as it should. If os.scandir('d1/d2')
succeeds, and "d2" contains file "f", then ('d1/d2', [], ['f'])
gets pushed on the stack. The latter gets yielded to rmtree()
, which calls os.unlink('d1/d2/f')
and then os.rmdir('d1/d2')
. Finally, ('d1', ['d2'], [])
gets yielded to rmtree()
, which calls os.rmdir('d1')
, and we're done.
You're totally right, of course Eryk :-). I must have made a mistake when I tried that approach previously, as when I try it now I get zero test failures from test_tempfile
.
I also measure
Path.walk()
as ~20% slower thanos.walk
, though my implementation ofPath.fwalk()
is within 5% ofos.fwalk()
.
Maybe that is close enough to just use Path.fwalk()
, although it's probably worth directly comparing any refactor of rmtree
to the existing implementation to make sure.
Why not adjust
os.walk
to allow returningos.DirEntry
objects? This could even be an internal function for now.What does the API look like?
It could simply mean introducing a kwarg like direntry=False
. This seems questionable to introduce a kwarg on os.walk
itself, but there are a couple of options:
os._walk
with direntry=True
, so this isn't public-facing. Maybe os.walk
just calls os._walk
always with direntry=False
.rmtree
and whatever else should make use of it (copytree
maybe) as something like os._walk
or os._fwalk
. This might essentially be what you've already done with Path.fwalk
. On the other hand, implementing a purely bottom up function that only returns os.DirEntry
objects for dirs could end up being simpler and faster by enough to be worthwhile.Either of these options could also help avoid introducing strict_dirnames=False
. And whether it's used for this or not, I think Path.fwalk
is good to have.
Personally I don't see an ideal solution, which can also comfortably made to be public-facing. Even if pathlib
's walk function were as fast as os
's , in some cases I specifically want string paths and avoiding Path
creation altogether is preferable. I'd personally love to have a public facing os.walk2
or something which implements os.walk
as we'd do it now if backwards compatibility weren't an issue. This is probably more extreme, ugly, confusing, etc than simply adding kwargs to os.walk
though.
What sort of thing would os._walk(direntry=True)
yield? A 2-tuple of (path, entries)
? A 3-tuple of (path, direntries, fileentries)
? A 4-tuple of (path, dirnames, direntries, filenames)
? Bearing in mind we want users to be able to influence the walk by adjusting the directory list, and distinguish between between files/directories/symlinks for the purposes of walking.
public facing
os.walk2
or something which implementsos.walk
as we'd do it now
If new functions are added, I suggest os.path.walk()
and os.path.fwalk()
. It's where I think they belong, in correspondence with pathlib.Path
. I'd prefer if it yielded os.DirEntry
instances for dirpath
and for the items in dirnames
and filenames
. There's no point in throwing away the basic stat information. I'd prefer for the onerror
handler to match shutil.rmtree()
, with three parameters: function
, path
, and excinfo
(or just the exception instance instead of excinfo
).
What sort of thing would
os._walk(direntry=True)
yield?
It would simply replace members of dirnames
and filenames
with os.DirEntry
objects. So this would basically be the current os.walk
implementation with a two line change, replacing dirs.append(entry.name)
with dirs.append(entry if direntry else entry.name)
- same for nondirs
. And the same could be done for fwalk
.
Of course if we just want a separate implementation, public or not, then we could maybe just get rid of direntry
and strict_dirnames
kwargs altogether. Instead we handle symlinks properly and always return os.DirEntry
objects (maybe still a string path for dirpath
though).
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['library', '3.9', 'type-crash']
title = 'shutil.rmtree and os.walk are implemented using recursion, fail on deep hierarchies'
updated_at =
user = 'https://bugs.python.org/AlexanderPatrakov'
```
bugs.python.org fields:
```python
activity =
actor = 'Alexander.Patrakov'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'Alexander.Patrakov'
dependencies = []
files = []
hgrepos = []
issue_num = 45564
keywords = []
message_count = 1.0
messages = ['404687']
nosy_count = 1.0
nosy_names = ['Alexander.Patrakov']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue45564'
versions = ['Python 3.9']
```
Linked PRs