os: add ReadDir method for lightweight directory reading

rsc commented 3 years ago

os.File provides two ways to read a directory: Readdirnames returns a list of the names of the directory entries, and Readdir returns the names along with stat information.

On Plan 9 and Windows, Readdir can be implemented with only a directory read - the directory read operation provides the full stat information.

But many Go users use Unix systems.

On most Unix systems, the directory read does not provide full stat information. So the implementation of Readdir reads the names from the directory and then calls Lstat for each file. This is fairly expensive.

Much of the time, such as in the implementation of file system walking, the only information the caller of Readdir really needs is the name and whether the name denotes a directory. On most Unix systems, that single bit of information—is this name a directory?—is available from the plain directory read, without an additional stat. If the caller is only using that bit, the extra Lstat calls are unnecessary and slow. (Goimports, for example, has its own directory walker to avoid this cost.) In fact, a survey of existing Go code found that only about 10% of uses of ReadDir actually need more than names and is-directory bits.

It appears that a third way to read directories should be added, to let all this code be written more efficiently. Expanding on a suggestion by @mpx, I propose to add:

// ReadDir reads the contents of the directory associated with the file f
// and returns a slice of DirEntry values in directory order.
// Subsequent calls on the same file will yield later DirEntry records in the directory.
//
// If n > 0, ReadDir returns at most n DirEntry records.
// In this case, if ReadDir returns an empty slice, it will return an error explaining why.
// At the end of a directory, the error is io.EOF.
//
// If n <= 0, ReadDir returns all the DirEntry records remaining in the directory.
// When it succeeds, it returns a nil error (not io.EOF).
func (f *File) ReadDir(n int) ([]DirEntry, error) 

// A DirEntry is an entry read from a directory (using the ReadDir method).
type DirEntry interface {
    // Name returns the name of the file (or subdirectory) described by the entry.
    // This name is only the final element of the path, not the entire path.
    // For example, Name would return "hello.go" not "/home/gopher/hello.go".
    Name() string

    // IsDir reports whether the entry describes a subdirectory.
    IsDir() bool

    // Info returns the FileInfo for the file or subdirectory described by the entry.
    // The returned FileInfo may be from the time of the original directory read
    // or from the time of the call to Info. If the file has been removed or renamed
    // since the directory read, Info may return an error satisfying errors.Is(err, ErrNotExist).
    // If the entry denotes a symbolic link, Info reports the information about the link itself,
    // not the link's target.
    Info() (FileInfo, error)
}

The FS proposal would then adopt this ReadDir and ignore Readdir entirely.

In #41188 I wrote:

Various people have proposed adding a third directory reading option of one form or another, to get names and IsDir bits. This would certainly address the slow directory walk issue on Unix systems, but it seems like overfitting to Unix.

I still believe that, but the survey convinced me that nearly all existing Readdir uses fall into this category, so it's not quite so bad to provide an optimized path for Unix systems. The DirEntry.Info method specification above allows both the eager info loading of Plan 9/Windows and the lazy loading needed on Unix. In contrast to #41188, the laziness is explicitly allowed from the beginning, and failures of the lazy loading can be reported in the error result.

Thoughts?

Update: A few clarifications to common questions:

Info returns the result of Lstat on systems where symbolic links happen.
On file systems that do not return even IsDir bits in the directory listing, ReadDir will have to do the full lstat loop that Readdir does today. That seems OK: slow file systems are slow.
IsDir returns false for symlinks to directories, since Lstat's FileInfo's IsDir does too.
ReadDir vs Readdir corrects the spelling of the name but does not differ only in case. It also differs in signature, so mistakes will not silently misbehave; they will loudly miscompile. In any event, the motivation here is to set a good, short name for the FS proposal, and for a method many people will need to implement, ReadDir is better than the longer ReadDirEntries.

Update 2: A few changes were made along the way to acceptnce. See https://github.com/golang/go/issues/41467#issuecomment-708536303 for the final version.

benhoyt commented 3 years ago

I like this a lot (it's similar to Python's os.scandir, but simpler). A few questions:

1) On Unix systems, the d_type field can be DT_UNKNOWN if the file system doesn't support it. Docs says "Currently, only some file systems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN." So we need to support this case. Either ReadDir could call Lstat at the time of the read if it gets any d_type==DT_UNKNOWN values, or we need to change the signature to IsDir() (bool, error) to reflect the fact that it too may call Lstat realtime -- I vote keeping the signature as Russ has it and having ReadDir call Lstat in that case (should be rare, so falling back to slower but correct seems fine). 2) The name ReadDir seems to similar to the existing Readdir -- differentiating just on the case of one letter seems like asking for trouble. What about ReadEntries? 3) Presumably IsDir and Info return info about the entry itself, without following symlinks? In other words, it's like Lstat and not Stat? Does this need mentioning explicitly? Or maybe it's clear from "the entry"? 4) What does IsDir return in the case of a symbolic link (to a directory)? Presumably false given the above.

mpx commented 3 years ago

This looks great.

I felt that there may be some advantage to making DirEntry a subset of FileInfo interface, hence I didn't include the Info method and started with the smaller interface. However, I can't convince myself there there is any tangible disadvantage to adding Info. There certainly needs to be a way of "upgrading" a DirEntry into a FileInfo.

One comment from #41188 regarding this suggestion was that d_type is not always available for IsDir. In those cases extra work needs to be done to provide that information. Name and IsDir are the minimum required to support implementing walk. Upgrading to FileInfo would be free on those systems, so it's no more expensive to do it this way.

It might be good to rename ReadDir due to the unfortunate similarity with Readdir. I understand it works in practice, but it might cause some confusion ("ReadDir" with the capital "D" or lowercase "d"?).

From my point of view, I think the ReadDir effectively replaces the need for Readdir and Readdirnames. If it existed from Go1 I wouldn't see the need for adding the other calls. Obviously the existing methods shouldn't be removed, but they may not be needed in io/fs.

I have a number tools that will benefit significantly from this proposal.

mpx commented 3 years ago

@benhoyt IsDir should return bool. As I mentioned on #41188 there is an advantage to providing interfaces that are guaranteed to work. Systems that cannot provide "IsDir" will need to do extra work up front. However, it will be effectively free for them to upgrade to FileInfo, so it should be no slower to "walk" a tree.

Good point regarding IsDir and symlinks. I think it needs to return false. Following a symlink will necessarily require another Stat to determine whether the target is a directory. This is probably worth documenting.

richardwilkes commented 3 years ago

Presumably IsDir and Info return info about the entry itself, without following symlinks? In other words, it's like Lstat and not Stat? Does this need mentioning explicitly? Or maybe it's clear from "the entry"?

Regardless of which way this ends up working (Lstat vs Stat), it should be mentioned explicitly in the comments/docs for the method, as what is "obvious" to one person will not be to another.

diamondburned commented 3 years ago

I felt that there may be some advantage to making DirEntry a subset of FileInfo interface, hence I didn't include the Info method and started with the smaller interface. However, I can't convince myself there there is any tangible disadvantage to adding Info. There certainly needs to be a way of "upgrading" a DirEntry into a FileInfo.

I personally don't see a point in this. Implementations could simply just return an internal FileInfo or itself when Info() is called.

rsc commented 3 years ago

@benhoyt, added answers above but (1) I agree with you; (2) ReadEntries is the wrong long-term name, and in the long term Readdir will just be deprecated (but not removed) and fade away; (3) yes, Lstat; (4) yes, false.

networkimprov commented 3 years ago

Russ, glad to see a new API proposed, thanks. Some Q's

This entails a per-item lstat() when the type is absent in native dirent. You don't want that if you don't need to check DirEntry.IsDir() for every result. Can it be optional?

Does DirEntry.Info() cache its result if it had to lstat(), or if ReadDir() had to?

Could we flush the FileInfo cache, so a subsequent .Info() reloads it?

Wouldn't the different behavior of DirEntry.Info() on unix & Windows cause trouble in cross-platform apps? Such apps would call os.File.Readdir() when they need FileInfo for all results, and xfs.File.ReadDir() otherwise. The mix of those two could be confusing.

There is no way to get the Windows fileId (analagous to inode) via FileInfo. Could a way be added to DirEntry?

Unix dirents include the inode -- needed for tree replication. Could DirEntry provide that?

For the latter two, I think you'd return a type FileId interface { ... }

EDIT: Background... I'm building a desktop app for Windows, MacOS, and Linux -- and a server app for Linux -- that make heavy use of the filesystem. Go has thrown me a fair number of curve balls, and I expect more :-/

diamondburned commented 3 years ago

This entails a per-item lstat() when the type is absent in native dirent. You don't want that if you don't need to check DirEntry.IsDir() for every result. Can it be optional?

As far as I know, no, this cannot be optional. Perhaps it could be if a more optimized Readdirnames could be made, but personally, I think that should stay there.

Does DirEntry.Info() cache its result if it had to lstat(), or if ReadDir() had to?

Could we flush the FileInfo cache, so a subsequent .Info() reloads it?

Wouldn't the different behavior of DirEntry.Info() on unix & Windows cause trouble in cross-platform apps? Such apps would call os.File.Readdir() when they need FileInfo for all results, and xfs.File.ReadDir() otherwise. The mix of those two could be confusing.

If we're going with my idea of returning an internal FileInfo, then I can see the behavior differing a little bit: on Windows, a FileInfo can be returned, but in Linux, a new FileInfo would be obtained on every Info() call.

In my honest opinion, this behavior is fine. In fact, it would be preferable if FileInfo of the Linux implementation is cached. The reason being that if the caller wants the newest information, then they should call Lstat or Stat explicitly.

There is no way to get the Windows fileId (analagous to inode) via FileInfo. Could a way be added to DirEntry?

Unix dirents include the inode -- needed for tree replication. Could DirEntry provide that?

Worst case scenario, the DirEnt interface could perhaps be overloaded to accommodate for the above flaws, but I feel like that would be overdesigning and overcomplicating it.

mpx commented 3 years ago

Could we flush the FileInfo cache, so a subsequent .Info() reloads it?

Any caller needing to guarantee a fresh stat should call Lstat/Stat themselves.

There is no way to get the Windows fileId (analagous to inode) via FileInfo. Could a way be added to DirEntry?

I've wanted to retrieve unique file identifiers for both Linux (inode) and Windows (fileId) to improve performance for several applications, so I'd appreciate a solution too. However, I think DirEntry should be kept simple. Most callers don't need a unique identifier, and it would unnecessarily complicate many implementations which don't use/need it. I think this could be solved via a separate proposal after this "DirEntry" proposal is settled. Eg, perhaps it could be a provided via an optional interface FileID() uint64.

rsc commented 3 years ago

I agree with what @diamondburned and @mpx said but just to reply directly as well.

This entails a per-item lstat() when the type is absent in native dirent. You don't want that if you don't need to check DirEntry.IsDir() for every result. Can it be optional?

In practice, when is that type absent? I'm not too worried if, say, VFAT file systems are slower to access.

Does DirEntry.Info() cache its result if it had to lstat(), or if ReadDir() had to?

The doc says the Info is either from the time of the ReadDir or the time of the Info call. So a stat during ReadDir can be cached, but not a stat during an earlier Info.

Could we flush the FileInfo cache, so a subsequent .Info() reloads it?

Adding cache state & manipulation to the API is needlessly complex. Lstat still exists.

Wouldn't the different behavior of DirEntry.Info() on unix & Windows cause trouble in cross-platform apps?

There is no magic wand here. Unix and Windows are different. They will always be different.

The path separator alone causes trouble in cross-platform apps. The answer is to use APIs like filepath.Clean and filepath.Join as appropriate. If you don't do that, you have trouble. It's not our job to make trouble impossible, only avoidable.

The docs on ReadDir are very clear about what you can rely on and what you cannot. If you rely on more than that, again, you will have trouble, same as hard-coding use of / or \ in your program.

Such apps would call os.File.Readdir() when they need FileInfo for all results, and xfs.File.ReadDir() otherwise. The mix of those two could be confusing.

Pretty much all apps should probably move to calling ReadDir or, if it matters, Lstat/Stat directly.

What's confusing is having two different APIs, but we clearly need a new one, and can't remove the old one. So be it.

There is no way to get the Windows fileId (analagous to inode) via FileInfo. Could a way be added to DirEntry? Unix dirents include the inode -- needed for tree replication. Could DirEntry provide that? For the latter two, I think you'd return a type FileId interface { ... }

[Note that Go would spell it FileID not FileId (the file does not have a mind).]

The concept seems too special-purpose for a general interface, and a bit difficult to use correctly. The problem is that Windows file IDs and Unix inode numbers are not really identifiers: they only identify a file within a particular file system. Another file system can have a file with the same file ID/inode number. So to use them properly you need to combine them with some kind of identifier for the file system itself (like fsid_t or dev_t on Unix). It gets messy fast.

Also, if you are writing "tree replication", then you are already stat'ing all the files to get the other metadata, and you're already writing very OS-specific code to preserve all the OS-specific attributes. That same code can easily grab the info you need as far as inode number and file system identifier.

It's not our job to union together all the possible APIs on all the possible systems. Our job is to find a simple API that is enough for the vast majority of Go programs, with an option for the rest to get at what they need (FileInfo.Sys in this case).

ianlancetaylor commented 3 years ago

What about Readdirnames?

If we expect all users of Readdirnames to switch to ReadDir over time, then I'm not quite convinced that IsDir should not return (bool, error). Neither AIX nor Solaris record the type of the file in the directory entry. Even on GNU/Linux there are various obscure file systems that do not record the dirent type.

The downside of IsDir returning (bool, error) seems fairly low to me.

networkimprov commented 3 years ago

Ian, wouldn't the error you're suggesting be returned by ReadDir() when it tries lstat() after seeing entries without types?

Or are you suggesting that ReadDir() should never lstat()?

ianlancetaylor commented 3 years ago

I'm suggesting that ReadDir should never call lstat, just as the current Readdirnames never calls lstat.

jimmyfrasche commented 3 years ago

Using the same name with different returns would mean a single type couldn't be both a DirEntry and a FileInfo

diamondburned commented 3 years ago

If we expect all users of Readdirnames to switch to ReadDir over time, then I'm not quite convinced that IsDir should not return (bool, error).

I don't think this should be the case, which is why I said "that should stay there." I think Readdirnames should use its own dirEnts to not require calling lstat or stat, which should allow people who don't need IsDir to skip having to call a few extra stats.

I also think that IsDir() (bool, error) would be a bit too verbose.

networkimprov commented 3 years ago

[ reposted after edits for clarity ]

It will cause confusion and bugs if .Info() & .IsDir() return new data every time on some filesystems but cached data on others. We performed a read, so we'd expect them to return whatever was retrieved by ReadDir() (possibly nothing). To get new data there's Lstat() as you said.

You could attain either best performance or cross-filesystem consistency if you could ask ReadDir() to:

satisfy .Name() -- never lstat (Ian's suggestion)
satisfy .Name(), .IsDir() -- possibly lstat (Russ' suggestion)
satisfy .Name(), .IsDir(), .Info() -- always lstat except on Windows (os.File.Readdir)

Let's not sacrifice sanity for the sake of simplicity.

if you are writing "tree replication", then you are already stat'ing all the files to get the other metadata

Actually, you don't need to lstat() if the inode matches one you've already seen.

an option for the rest to get at what they need (FileInfo.Sys in this case).

To clarify, there is no way to get the Windows fileID from FileInfo.Sys. You have to patch the stdlib at present; see https://groups.google.com/g/golang-dev/c/raE01Fa2Kmo/m/3Ah9JslnBgAJ

BTW Windows accepts / path separators; you only need Windows-native syntax in rare cases :-)

rsc commented 3 years ago

It will cause confusion and bugs if .Info() & .IsDir() return new data every time on some filesystems but cached data on others.

Sorry, but that's the design constraint here: it must be possible to either return info learned during ReadDir or info learned during Info. Otherwise you are overfitting to either Unix or non-Unix and penalizing the other.

rsc commented 3 years ago

Neither AIX nor Solaris record the type of the file in the directory entry.

Thanks for pointing this out. That's unfortunate but good to know.

jimmyfrasche commented 3 years ago

Sorry, but that's the design constraint here: it must be possible to either return info learned during ReadDir or info learned during Info. Otherwise you are overfitting to either Unix or non-Unix and penalizing the other.

I think the argument is about the second call to Info, not the first:

de.Info() // maybe cached depending on os/fs
de.Info() // always cached in all cases for this and all subsequent calls

networkimprov commented 3 years ago

os.File.Readdirnames() yields predictable results & performance. os.File.Readdir() yields predictable results, but unpredictable performance.

The proposed File.ReadDir() yields unpredictable results & performance. It won't supplant os.File.Readdir or os.File.Readdirnames -- we'll still need both.

We can fix that with a simple adjustment to the API:

func (f *File) ReadDir(k deKind, n int) ([]DirEntry, error) // k specifies cached results

type deKind int
const (
   KindName deKind = iota  // cache Name; possibly IsDir & Info
   KindIsdir               // cache Name & IsDir; possibly Info
   KindInfo                // cache all
)

type DirEntry interface {   // methods do not lstat()
   Name() string
   IsDir() (bool, error)    // return error if dirent type not cached
   Info() (FileInfo, error) // return error if FileInfo not cached
}

I think the argument is about the second call to Info, not the first:

The second .Info() call is a problem, but so is the first if it occurs after the directory (or any parent) is moved or renamed.

. cc @tv42 @israel-lugo

diamondburned commented 3 years ago

A new enum argument is too verbose for something that simple, in my opinion. Furthermore, needing both Readdir and Readdirnames is fine, or perhaps, this could be ReadDirEnt.

Why should ReadDirEnt replace Readdir and Readdirnames anyway? Readdir is for when you need the entire FileInfo, and Readdirnames is for when you only need the names of the directory. I see no reasons for ReadDirEnt to replace both of them other than trying to save a few lines of code in the implementation of those two functions.

I would also like to point out again that having IsDir() return an error is really verbose. I would much prefer an lstat fallback over having to handle errors from IsDir(). Not to mention, having that function call lstat later and then cache only makes it more unpredictable.

networkimprov commented 3 years ago

After f.ReadDir(fs.KindIsdir, n) you don't have to check the .IsDir() error. After f.ReadDir(fs.KindInfo, n) you don't have to check the .Info() or .IsDir() errors.

There is no caching by DirEntry methods. My amendment entails zero lstats by DirEntry methods.

Note: the proposed File.ReadDir would be the only directory API in io/fs.

EDIT: Predictable = produce the same results for a raceless sequence of events on all filesystems. My amendment provides that for all cases. (The following post is incorrect.)

diamondburned commented 3 years ago

My amendment entails zero lstats by DirEntry methods.

IsDir() would need an lstat to get the boolean though. The proposal only ensures predictable behavior for some few use-cases that could be done with the old Readdir.

Now that I think about it, I think trying to guarantee a consistent behavior with this API is almost moot. There isn't a global file system lock to be acquired, so changes could still happen while Readdir is looping. I don't think there's a point in ensuring predictable behavior if we cannot fully get rid of it, since the cost of having a more verbose API is worse, in my opinion.

networkimprov commented 3 years ago

I amended my previous comment to clarify.

mpx commented 3 years ago

Allowing/requiring a DirEntry to perform lstat on demand for IsDir would result in a poor API:

It may fail, and would require require an error return which complicates using the API
IsDir would have a different signature to FileInfo.IsDir (somewhat confusing). It will be desirable for some io/fs implementations to use the same type for DirEntry and FileInfo - we shouldn't prevent that. This will likely be common in many simpler implementations.
Implementations might choose caching the IsDir lstat response (more complicated non-immutable DirEntry), or discarding the response and causing duplicate lstat for callers via Info. This goes against the reason for this interface in the first place.

We should not require complex caching/reuse logic to be implemented for DirEntry. I'd strongly prefer encouraging immutable implementations for DirEntry: they are easier to implement correctly and harder to misuse inadvertantly. ReadDir should return an immutable value with all the details needed to satisfy the interface (IsDir).

How common is Readdirnames in performance code? I suspect there will be less overall performance impact since most platforms provide enough details to support walking a tree within their dirents, and that the current walk performance is a larger problem for most users. If so, keeping IsDir() bool in the DirEntry interface is probably still a better outcome overall.

mpx commented 3 years ago

Alternatively, we could consider making IsDir() bool an optional interface.

This would enable DirEntry implementations to provide an immutable type with all the details available in the dirent, and allowing efficient tree walking where practical. Each platform could provide the most efficient implementation available (names only, names + directory?, names + stat). This improves the case for ReadDir replacing Readdirnames and Readdir everywhere (which is good for the io/fs proposal).

Using an optional method complicates using the interface, but less than providing an error since presence of a method is binary (the information either exists, or it doesn't). Providing an error would require checking for specific errors and handling implementation variation (as well as the disadvantages above).

mpx commented 3 years ago

Separately, I'd prefer to see something like FileType() FileType instead of IsDir() bool to fix the conflation of file types and permissions, and allow more options when walking DirEntry. File types are often available in dirents (unlike permissions). File types are an enumeration but permissions are a bit field - FileMode merges them and makes it awkward to provide access to file types via DirEntry.

type DirEntry interface {
   Name() string
   FileType() FileType   // Optional?
   Info() (FileInfo, error)
}

I understand there are reasons for sticking with IsDir() bool since it's a common pre-existing signature. I'd take IsDir() bool if it was the only acceptable choice for a new interface that supports efficiently processing dirents (optional or not).

Using optional interfaces might make it more practical to introduce later (although a second change to the interface is probably unlikely).

tv42 commented 3 years ago

Alternatively, we could consider making IsDir() bool an optional interface.

The best current practice of optional interfaces would mean that optional IsDir would have to return (bool, error), just to enable wrapping.

rsc commented 3 years ago

We're not putting flags into the argument to ReadDir. That's too much API complexity.

Let's call the systems that don't return IsDir bits in the readdir system call "old". For concreteness, those systems are AIX, Solaris, and certain old file systems on other systems (unclear which, but not very many).

It seems like the options are:

Make IsDir lazy on old systems but return false if it runs into a problem. This is a lie.
Make IsDir lazy on old systems but returning (bool, error) (all systems). Then it ends up different from FileInfo and complicates call sites.
Make ReadDir prefetch the IsDir results on old systems. This is a little slower (like Readdir today) but doesn't affect most systems, avoids an API difference with FileInfo, and avoids the need to worry about synchronization between lazy IsDir and lazy Info.

Given how rare the problem seems to be, it seems like we should probably take option (3), keeping these old systems from causing API complexity on all the others.

earthboundkid commented 3 years ago

If this is added, will package io/fs get a fast Walk implementation that uses ReadDir? Will filepath.Walk be deprecated? It would be exciting if this could close #16399 too.

rsc commented 3 years ago

@carlmjohnson Yes, one of the goals here is to establish an API for io/fs that enables a standard fast walk API, probably similar to github.com/kr/fs Walker, and close #16399.

jimmyfrasche commented 3 years ago

Would old fses cache the Info like on Windows?

Should Info be an optional method with an Info function that grabs the provided info or falls back to making a Stat?

rsc commented 3 years ago

Would old fses cache the Info like on Windows?

Yes. For clarity, there are four cases:

Plan 9 and Windows return full metadata during directory read. On those systems, ReadDir would save that info. IsDir and Info would both use that saved info.
Most Unix systems using most file systems return name, inode number, and inode type bits during directory read. On those systems, ReadDir would save that (more limited) info. IsDir would use the type bits to answer. Info would call Lstat each time it is called. If the Lstat fails, it returns an error. If the inode number or type bits don't match, it also returns an error (ErrNotExist probably).
Old Unix systems (AIX, Solaris) return no type bits during directory read. On those systems, ReadDir would call Lstat for each directory entry and save that info, as Readdir does on all Unix systems. IsDir and Info would both use that saved info.
Newer Unix systems can return no type bits (DT_UNKNOWN) during directory read for certain old file systems. On those systems, ReadDir would recognize DT_UNKNOWN and fall back to the "old Unix system" behavior.

Code using ReadDir and Info must be written so that it does not care whether the returned information is from the time of the ReadDir or the time of the Info call, in order to avoid optimizing for and overfitting to one operating system over another. A typical ReadDir+loop over entries inspecting them immediately fits this requirement - if there are racing changes to the underlying directory, it is already indeterminate whether the change would be observed or not, regardless of whether the observation is made during ReadDir or Info.

rsc commented 3 years ago

Should Info be an optional method with an Info function that grabs the provided info or falls back to making a Stat?

I'm sorry, I didn't understand that, but it seems important to me that Info is a required method. Otherwise portable code that just wants to get some extra info is greatly complicated. That's a use case we do want to support well.

jimmyfrasche commented 3 years ago

My thought was an optional Info() FileInfo method only on entries with cached data and an Info(DirEntry) (FileInfo, error) that returned the cached data or made the stat, but it would probably need to peek into the DirEntry's implementation to figure out what to stat without more API so retracted.

nightlyone commented 3 years ago

Could you add IsRegular() too? So that one can do the special case handling only on !(IsRegular() || IsDir()) then? Otherwise it is not decidable whether one needs to descend or not given IsDir() == false only. I am not sure how cheap IsRegular() is to determine on Windows, but maybe @alexbrainman could chime in there.

benhoyt commented 3 years ago

@nightlyone, do you mean in the case of a symlink (pointing to a directory)? It strikes me that usually you want to walk a directory tree not following symlinks (I could be wrong here?), but in the cases where you do want to walk a tree following symlinks, just IsDir isn't enough. Because if IsDir returns false, it might be a symlink (pointing to a directory), and you should recurse in. But you don't know that unless you call Lstat on every non-directory entry ... which is a waste.

So I think we need IsRegular too (or IsSymlink, but there's a prececent for IsRegular in FileMode). Otherwise anyone who wants to walk following symlinks (on Linux) is back to basically an lstat syscall per file, defeating the purpose of this new API.

Alternatively, instead of IsDir we could have a new FileType method as @mpx suggested above (though I'd suggest the signature EntryType() EntryType) which would return EntryDir, EntryFile, EntrySymlink, or EntryOther. And IsDir() would becomes e.EntryType() == os.EntryDir (or we could keep IsDir in addition).

cespare commented 3 years ago

FWIW I took a look at all the filepath.Walk calls in the Go code I have handy (a large private Go codebase plus a few open source projects) and classified them based on how they used the os.FileInfo argument:

what is used	count
`info` is unused	1
only `info.Name()`	2
only `info.IsDir()`	13
`info.Name()` and `info.IsDir()`	1
only `info.Mode().IsRegular()`	1
`info.IsDir()` plus `info.Mode()&os.ModeSymlink`	1
`info.IsDir()` plus `info.Mode()` (using the entire `FileMode` as, for example, an arg to `os.Chmod`)	1
multiple methods (`ModTime`, `Size`, etc)	3

diamondburned commented 3 years ago

Given that the file mode is accessed several times for purposes other than IsDir(), maybe DirEntry should return an EntryType as @benhoyt suggested?

networkimprov commented 3 years ago

I'm sorry, I cannot support an API that yields unpredictable results[1] for DirEntry.Info(). It's contrary to the effect of os.Readdir, and users would miss the fine distinction. The resulting bugs would be silent and hard to find. This would affect programs ported between Windows and Linux or MacOS, a common scenario.

Any lstat() after ReadDir() should be explicit. @jimmyfrasche's suggestion in https://github.com/golang/go/issues/41467#issuecomment-697924028 and mine in https://github.com/golang/go/issues/41467#issuecomment-696891178 would fix it. There are other options, as well.

Re lstats in order to fulfill .IsDir() on "old unix" filesystems, it's odd to deny those systems an io/fs API which accomplishes os.Readdirnames. But at least the results are predictable.

I agree that .EntryType() is preferable to .IsDir(). But let's provide both.

, [1] Predictable = the same results for a raceless sequence of events on all filesystems.

diamondburned commented 3 years ago

I'm sorry, I cannot support an API that yields unpredictable results[1] for DirEntry.Info(). It's contrary to the effect of os.Readdir, and users would miss the fine distinction. The resulting bugs would be silent and hard to find. This would affect programs ported between Windows and Linux or MacOS, a common scenario. Any lstat() after ReadDir() should be explicit. @jimmyfrasche's suggestion in #41467 (comment) and mine in #41467 (comment) would fix it. There are other options, as well

Why not use Readdir then? I don't think either Readdir or Readdirnames are going away, so if you want completely "predictable" behaviors (which aren't completely predictable), why not just use the former 2?

I agree that .EntryType() is preferable to .IsDir(). But let's provide both.

Providing both would be pretty redundant. I think it should be one or the other, and if it's ever EntryType, then one could perhaps do ent.EntryType().IsDir().

ianlancetaylor commented 3 years ago

@networkimprov On Unix there is inevitably a separation between the time that the directory is read and the time that the program calls lstat on the directory entry. Further, there is inevitably a separation between the time the program calls lstat and the time that the program uses the returned FileInfo. Races are always possible, and there is no way to avoid that with current Unix APIs. So it's not clear to me what kind of cases can lead to real problems. Can you sketch out an example? Thanks.

networkimprov commented 3 years ago

Ian, I did say "raceless sequence of events". For example, I reuse directory listings:

f, err := os.Open("mydir")
if err ...
defer f.Close()
dir, err := f.ReadDir(0)
if err ...

for _, de := range dir {
   info, err := de.Info()                        // not required for failure below
   if err ...
   if info.ModTime().After(timeMin) {
      err = os.Rename("mydir/"+ de.Name(), "later/"+ de.Name())
      if err ...
   }
}
...
for _, de := range dir {
   if strings.HasSuffix(de.Name(), ".pdf") {
      info, err := de.Info()                    // intermittent failure on Linux & MacOS
      if err ...
      if info.ModTime().After(timeMin) { continue }
      // further work
   }
}

That at least results in an error. The situation is worse if you edit a file, changing its size & modtime.

@diamondburned pls read https://go.googlesource.com/proposal/+/master/design/draft-iofs.md

diamondburned commented 3 years ago

@networkimprov This issue only concerns itself with os.File though (and so is your example code). If you want consistent results on future reads for whatever API that will be added to io/fs, just call Info() once on all of the entries to guarantee that they're saved.

ianlancetaylor commented 3 years ago

@networkimprov Thanks. If I understand correctly, your concern is that if you call the Info method multiple times, it will not always return the same result. I think we can address that through documentation. After all, it's already true of os.Lstat.

mpx commented 3 years ago

@diamondburned The proposal is for Info to return the FileInfo from the time ReadDir was called, or the time of current call to Info. There is no intention for Info implementations to cache their result - and we shouldn't encourage it either for the reasons I outline above.

Similar to calling Stat, any code that requires a fixed FileInfo to operate correctly needs to store and reuse it themselves. Each call to Stat or Info potentially results in a different value, there is no sensible way around it.

networkimprov commented 3 years ago

Ian, the problem is that .Info() always returns the initial result on Windows & "old unix", but always reads from storage ~cannot have an initial result~ on Linux & MacOS. I do not believe documentation is an acceptable remedy.

os.Lstat() is an explicit request to read storage, on all platforms. Its result is predictable in raceless code.

@diamondburned you have mistakenly contradicted me several times; kindly assume that my facts are correct.

tv42 commented 3 years ago

Ian, the problem is that .Info() always returns the initial result on Windows & "old unix", but cannot have an initial result on Linux & MacOS. I do not believe documentation is an acceptable remedy.

What's an "initial result"? getdents+lstat is exactly as "non-initial" as getdents+sleep+lstat, the only difference is the size of the race window.

Your two loops assume they see the same os.FileInfo. If ReadDir Info is documented to not have that guarantee, there's no problem here, and you just collect a []os.FileInfo when you require a snapshot.

mpx commented 3 years ago

Expanding on the desire for a FileType... (or EntryType -- I'm less concerned with the naming and more interested in the concept).

When walking a tree there are 3 file types that are generally useful to recognise:

directories: potentially recurse
regular files: potentially process
symbolic links: potentially stat to make a decision (follow links? directory or file?)

Custom handling for other irregular/special files is less common (eg, fifos, devices, unix sockets,..).

IsDir only provides the first which helps, but isn't enough to reduce the need to call lstat for the other 2 cases (regular files, symlinks).

A single method could provide these details avoiding the need for another lstat, or separate IsDir, IsRegular, IsSymlink methods. Typically file types either exist in dirents, or they don't. Hence I don't see any reason why they shouldn't be made available via DirEntry (vs being limited to IsDir).

There are probably 2 options for providing the FileType enumeration:

Reuse os.FileMode but explicitly document that permissions will never be set. Pro: doesn't require another type/constants.
Create a new enumeration type which avoids the bitfield. Pro: easier to use than the bitfield, less confusion over whether permissions exist or not.

I prefer option 2, but option 1 would still solve the problem.

..and while limited, IsDir is an improvement over the status quo.

earthboundkid commented 3 years ago

FWIW, filepath.Walk does not follow symbolic links. My assumption is that fs.Walk will not either. Following symlinks without getting lost requires some kind of Tarjan algorithm for detecting cycles, and I don't think we want to burden regular fs.Walk with that, especially since part of the goal is resolving #16399 by making file walking faster, not slower. As long as IsDir() is false for symlinks, fs.Walk should skip past the symlink, and everything is fine, no?

golang / go

os: add ReadDir method for lightweight directory reading #41467