ncoghlan / walkdir

Other
11 stars 3 forks source link

``all_paths`` don't yield subdirs of last directory when depth is given #4

Closed palaviv closed 8 years ago

palaviv commented 8 years ago

If I have the following directory tree:

test
|--file1.txt
|--file2.txt
|--test2
\----file1.txt
\----file2.txt
\----test3
|--test4
\----file1.txt
\----test5

When calling all_paths(filtered_walk('test', depth=1)) "test3, test5" directory's will not be listed.

palaviv commented 8 years ago

This issue was fixed in #5.

ncoghlan commented 8 years ago

Working on the release notes for 0.4, I realised the solution here isn't correct, as it means dir_paths emits directory paths beyond the requested depth limit, even though the underlying walk doesn't traverse those directories.

That is, for the given hierarchy, dir_paths(filtered_walk('test', depth=1)) produces the following with the current PyPI release:

test
test/test2
test/test4

With current master, it produces:

test
test/test2
test/test4
test/test2/test3
test/test4/test5

Unlike the file listings (where the parent depth is clearly the depth of interest), those are depth 2 directory paths in a depth 1 walk. While it's arguable whether or not those paths should be shown when all_paths is applied to a depth 1 walk, I now think it's definitely wrong for dir_paths.

In terms of the way the limit_depth filter works, what's happening is that it only clears the subdirectory listing after the parent directory has been processed. This means that the directories don't get visited, but the updated dir_paths and all_paths still display them (since they now use the pre-clear subdirectory list, rather than the post-clear traversal).

I'll propose a new approach to handling that in a separate comment.

ncoghlan commented 8 years ago

In some ways, this is a variant of the original dirsymlink problem covered in https://github.com/ncoghlan/walkdir/pull/2: there are multiple options for handling leaf directories in limit_depth, and the current API only supports one of them. Specifically, the options are:

Those are the only options the os.walk triplet model provides - there's no way for an underlying walk to report "subdirectory that will not be traversed" except to move it to the file list.

ncoghlan commented 8 years ago

So here's my idea: what if we changed limit_depth to clear the subdirectory listing early by default, but added a new leaf_subdirs_as_files option that included the subdirectories of leaf directories in the file listing, but appended os.sep to them in the process. file_paths and all_paths would then be adjusted to filter out any file listing entries that ended with os.sep.

With that change, the expected output when iterating over filtered_walk("test", depth=1) would revert to what it currently produces with 0.3.

dir_paths:

test
test/test2
test/test4

file_paths:

test/file1.txt
test/file2.txt
test/test2/file1.txt
test/test2/file2.txt
test/test4/file1.txt

all_paths:

test
test/file1.txt
test/file2.txt
test/test2
test/test4
test/test2/file1.txt
test/test2/file2.txt
test/test4/file1.txt

With this model, all_paths could also gain a new include_leaf_subdirs option to skip filtering out the entries that end with os.sep, and instead strip the added suffix, giving:

test
test/file1.txt
test/file2.txt
test/test2
test/test4
test/test2/file1.txt
test/test2/file2.txt
test/test2/test3
test/test4/file1.txt
test/test4/test5

This approach is by no means perfect (since the file lists emitted by limit_depth may contain items that aren't actually files), but seems to offer a reasonable trade-off between hiding information about the subdirectories of the leaf directories by default (matching what 0.3 currently does at the convenience iterator level) while still getting access to that info when desired.

Unfortunately, I'm not currently working on a project that does a lot of filesystem traversal (that's why walkdir development stalled for so long - I moved on from the project I wrote it to help with), so I don't have a great intuition for the most useful API here :(

ncoghlan commented 8 years ago

There's also a simpler option (which avoids touching the file list at all), which is to make the option "include_leaf_subdirs", and have it be off by default for both limit_depth and filtered_walk (preserving the current behaviour of both dir_paths and all_paths).

Setting it to on would restore the current behaviour of limit_depth (where it doesn't clear the subdirectory list until after iteration) in the underlying walk, with the corresponding changes to the output of dir_paths and all_paths.

palaviv commented 8 years ago

I am against changing filtered_walk as it is very useful without the *_paths functions and the current implementation of filtered_walk works great and is not changed from 0.3. What I think would be a better solution is just changing dir_path documentation to explain that it will list all subdirs of the folders in depth depth.

ncoghlan commented 8 years ago

After sleeping on it, what do you think of the idea of reverting dir_paths to its 0.3 implementation (so it goes back to only reporting visited directories) and adding a subdir_paths that uses the current implementation where new root directories are still reported when visited, but subdirectories are reported when their parent directory is visited (thus capturing subdirectories of leaf directories in a depth limited walk, as well as symlinks to directories when link following is turned off).

Then the only user visible change in behaviour for 0.4 would be that all_paths would become an interleaved combination of subdir_paths and file_paths, rather than dir_paths and file_paths.

ncoghlan commented 8 years ago

Since adding subdir_paths and reverting the behaviour of dir_paths is a separate idea (now filed as #12) and doesn't impact the new behaviour of all_paths, closing this again.