proposal: io/fs,path/filepath: add iterator form of WalkDir

rogpeppe commented 11 months ago

It's often more convenient to use a for loop rather than a callback-based API, and fs.WalkDir and filepath.Walk are both callback-based APIs. I've been using github.com/kr/fs for years to work around this. The new iterate-over-func functionality (#61405) now makes it straightforward to adapt the stdlib functions into iterators, so perhaps we could provide iterator-based APIs as part of the standard library now.

Here's a straw-man implementation:

package fs

type WalkDirEntry struct {
    Path    string
    Entry   fs.DirEntry
    Err     error
    skipDir *bool
}

// SkipDir causes the iteration to skip the contents
// of the entry. This will have no effect if called outside the iteration
// for this particular entry.
func (entry WalkDirEntry) SkipDir() {
    *entry.skipDir = true
}

// WalkDirIter returns an iterator that can be used to iterate over the contents
// of a directory. It uses WalkDir under the hood. To skip a directory,
// call the SkipDir method on an entry.
func WalkDirIter(fsys fs.FS, root string) func(func(WalkDirEntry) bool) {
    return func(yield func(WalkDirEntry) bool) {
        fs.WalkDir(fsys, root, func(path string, d fs.DirEntry, err error) error {
            skipDir := false
            info := WalkDirEntry{
                Path:    path,
                Entry:   d,
                Err:     err,
                skipDir: &skipDir,
            }
            if !yield(info) {
                return fs.SkipAll
            }
            if skipDir {
                return fs.SkipDir
            }
            return nil
        })
    }
}

The main point of contention from my point of view is the semantics of the SkipDir method. Calling a method on an iterated item to determine the subsequent course of iteration seems somewhat unusual, but I don't see any other decent alternative.

[Edited to remove incorrect remarks about GC behavior].

mateusz834 commented 11 months ago

Any reason not to return the error as a second value? like this:

type WalkDirEntry struct {
    Path    string
    Entry   fs.DirEntry
    skipDir *bool
}

func WalkDirIter(fsys fs.FS, root string) func(func(WalkDirEntry, error) bool) {

Seems like the conclusion of #61405 is to return errors this way. https://github.com/golang/go/issues/61405#issuecomment-1708800305

rogpeppe commented 11 months ago

Any reason not to return the error as a second value? like this:

That was @Merovius's suggestion too. I wasn't sure, because the Path and Entry fields can be valid even when the error isn't, making the docs a bit awkward to write, which made me think it might not be the correct approach.

Nonetheless, here's a stab at it. Maybe it's not so awkward after all.

// WalkDirIter returns an iterator that can be used to iterate over the contents
// of a directory. The implementation uses [WalkDir].
//
// Each iteration produces an entry and an error that will be non-nil
// for unreadable directories. The [WalkDirEntry.Path] field is always valid;
// the [WalkDirEntry.Entry] field is always valid except for when the
// initial Stat on the root directory fails.
//
// If a directory's ReadDir method fails, the iteration will produce two
// entries, both with the directory's path: first with a nil error,
// giving the body of the loop the chance to stop or call [WalkDirEntry.SkipDir],
// and secondly with the error from ReadDir.
//
// To skip a directory, call the SkipDir method on an entry.
func WalkDirIter(fsys fs.FS, root string) func(func(WalkDirEntry, error) bool) {
    var skipDir bool // outside the loop so we'll only allocate once.
    return func(yield func(WalkDirEntry, error) bool) {
        fs.WalkDir(fsys, root, func(path string, d fs.DirEntry, err error) error {
            info := WalkDirEntry{
                Path:    path,
                Entry:   d,
                skipDir: &skipDir,
            }
            skipDir = false
            if !yield(info, err) {
                return fs.SkipAll
            }
            if skipDir {
                return fs.SkipDir
            }
            return nil
        })
    }
}

mateusz834 commented 11 months ago

I am also not a big fan of the SkipDir method, it looks weird to skip something (change the state of the iterator) using the result of the iterator, but on the other hand I don't know how this could be done differently with range-over-func. Was this discussed somewhere in #61405?

Merovius commented 11 months ago

It has been mentioned at least once

gazerro commented 11 months ago

I've observed (https://sourcegraph.com/search?q=context%3Aglobal+lang%3Ago+fs.WalkDir&patternType=standard&sm=1&groupBy=repo) that most fs.WalkDirFunc functions first check for the existence of an error. This makes me inclined to sympathize with @mateusz834's suggestion. However, what I observe most is that when the error is not checked first, access to the DirEntry argument occurs without verifying that it is not nil. (It is nil if the stat on the root directory has failed.)

This lack of checking obviously leads to a panic. There are also two missing checks in two tests in the Go repository.

I suggest not passing the error as a parameter, that is, leaving only one parameter, but ensuring that WalkDirIter returns an error in case the stat on the root directory fails:

files, err := fs.WalkDirIter(fsys, root)
if err != nil {
      // stat on the root directory failed
      ...
}
for file := range files {
    _ = file.Entry.Name() // this cannot panic because file.Entry is always guaranteed to be non-nil
}

rogpeppe commented 11 months ago

@gazerro Yes, I tend to agree, although it's a bit unfortunate that that makes it considerably harder (or perhaps no possible?) to implement WalkDirIter in terms of WalkDir AFAICS. Perhaps it might be preferable to reverse that, because I think it should be easier to implement WalkDir in terms of WalkDirIter with those semantics.

gazerro commented 11 months ago

Perhaps it might be preferable to reverse that, because I think it should be easier to implement WalkDir in terms of WalkDirIter with those semantics.

Yes, I agree. It would be more appropriate to implement WalkDir in terms of WalkDirIter and possibly deprecate WalkDir.

gazerro commented 11 months ago

I also suggest naming the new function DirWalker (or perhaps simply Entries? Though, this name might be overly simplistic). While WalkDir appropriately denotes an action, the new function simply returns an iterator (or a sequence, as one prefers to say). So it would be used as:

entries, err := fs.DirWalker(fsys, root)
if err != nil {
      // stat on the root directory failed
      ...
}
for entry := range entries {
    // ...
}

earthboundkid commented 11 months ago

Individual subdirectories and files can fail, so I don’t know what it saves you to return an error for the initial stat.

gazerro commented 11 months ago

@carlmjohnson if the first stat fails, the DirEntry.Entry field is nil. Therefore, before accessing DirEntry.Entry, it is necessary to check that it is not nil. This should be done both within an if err != nil { ... } statement and in cases where the error is not handled because we are not interested in handling it.

I have noticed many cases (https://sourcegraph.com/search?q=context%3Aglobal+lang%3Ago+fs.WalkDir&patternType=standard&sm=1&groupBy=repo) where this check is not done.

In addition, this is my opinion, I find it incorrect for the function to return an iterator on a root directory that does not exist.

golang / go

proposal: io/fs,path/filepath: add iterator form of WalkDir #64341