Closed palaviv closed 8 years ago
This issue was fixed in #5.
Working on the release notes for 0.4, I realised the solution here isn't correct, as it means dir_paths
emits directory paths beyond the requested depth limit, even though the underlying walk doesn't traverse those directories.
That is, for the given hierarchy, dir_paths(filtered_walk('test', depth=1))
produces the following with the current PyPI release:
test
test/test2
test/test4
With current master, it produces:
test
test/test2
test/test4
test/test2/test3
test/test4/test5
Unlike the file listings (where the parent depth is clearly the depth of interest), those are depth 2 directory paths in a depth 1 walk. While it's arguable whether or not those paths should be shown when all_paths
is applied to a depth 1 walk, I now think it's definitely wrong for dir_paths
.
In terms of the way the limit_depth
filter works, what's happening is that it only clears the subdirectory listing after the parent directory has been processed. This means that the directories don't get visited, but the updated dir_paths and all_paths still display them (since they now use the pre-clear subdirectory list, rather than the post-clear traversal).
I'll propose a new approach to handling that in a separate comment.
In some ways, this is a variant of the original dirsymlink
problem covered in https://github.com/ncoghlan/walkdir/pull/2: there are multiple options for handling leaf directories in limit_depth
, and the current API only supports one of them. Specifically, the options are:
dir_paths
and all_paths
, as well as in the filtered_walk
and limit_depth
resultsdir_paths
and all_paths
when combined with limit_depth
, at the expense of changing the behaviour of filtered_walk
and limit_depth
(since the subdirectory list for leaf directories will now be empty during processing).dir_paths
, but changes the behaviour of file_paths
and all_paths
(to include subdirectories of leaf directories) and of filtered_walk
and limit_depth
(since subdirectories of leaf directories will appear in the file list rather than the subdirectory list)Those are the only options the os.walk triplet model provides - there's no way for an underlying walk to report "subdirectory that will not be traversed" except to move it to the file list.
So here's my idea: what if we changed limit_depth
to clear the subdirectory listing early by default, but added a new leaf_subdirs_as_files
option that included the subdirectories of leaf directories in the file listing, but appended os.sep
to them in the process. file_paths
and all_paths
would then be adjusted to filter out any file listing entries that ended with os.sep
.
With that change, the expected output when iterating over filtered_walk("test", depth=1)
would revert to what it currently produces with 0.3.
dir_paths
:
test
test/test2
test/test4
file_paths
:
test/file1.txt
test/file2.txt
test/test2/file1.txt
test/test2/file2.txt
test/test4/file1.txt
all_paths
:
test
test/file1.txt
test/file2.txt
test/test2
test/test4
test/test2/file1.txt
test/test2/file2.txt
test/test4/file1.txt
With this model, all_paths
could also gain a new include_leaf_subdirs
option to skip filtering out the entries that end with os.sep
, and instead strip the added suffix, giving:
test
test/file1.txt
test/file2.txt
test/test2
test/test4
test/test2/file1.txt
test/test2/file2.txt
test/test2/test3
test/test4/file1.txt
test/test4/test5
This approach is by no means perfect (since the file lists emitted by limit_depth
may contain items that aren't actually files), but seems to offer a reasonable trade-off between hiding information about the subdirectories of the leaf directories by default (matching what 0.3 currently does at the convenience iterator level) while still getting access to that info when desired.
Unfortunately, I'm not currently working on a project that does a lot of filesystem traversal (that's why walkdir development stalled for so long - I moved on from the project I wrote it to help with), so I don't have a great intuition for the most useful API here :(
There's also a simpler option (which avoids touching the file list at all), which is to make the option "include_leaf_subdirs", and have it be off by default for both limit_depth
and filtered_walk
(preserving the current behaviour of both dir_paths
and all_paths
).
Setting it to on would restore the current behaviour of limit_depth
(where it doesn't clear the subdirectory list until after iteration) in the underlying walk, with the corresponding changes to the output of dir_paths
and all_paths
.
I am against changing filtered_walk
as it is very useful without the *_paths
functions and the current implementation of filtered_walk
works great and is not changed from 0.3.
What I think would be a better solution is just changing dir_path
documentation to explain that it will list all subdirs of the folders in depth depth
.
After sleeping on it, what do you think of the idea of reverting dir_paths
to its 0.3 implementation (so it goes back to only reporting visited directories) and adding a subdir_paths
that uses the current implementation where new root directories are still reported when visited, but subdirectories are reported when their parent directory is visited (thus capturing subdirectories of leaf directories in a depth limited walk, as well as symlinks to directories when link following is turned off).
Then the only user visible change in behaviour for 0.4 would be that all_paths
would become an interleaved combination of subdir_paths
and file_paths
, rather than dir_paths
and file_paths
.
Since adding subdir_paths
and reverting the behaviour of dir_paths
is a separate idea (now filed as #12) and doesn't impact the new behaviour of all_paths
, closing this again.
If I have the following directory tree:
When calling
all_paths(filtered_walk('test', depth=1))
"test3, test5" directory's will not be listed.