Closed rsc closed 3 years ago
I like this a lot (it's similar to Python's os.scandir
, but simpler). A few questions:
1) On Unix systems, the d_type
field can be DT_UNKNOWN
if the file system doesn't support it. Docs says "Currently, only some file systems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN." So we need to support this case. Either ReadDir
could call Lstat at the time of the read if it gets any d_type==DT_UNKNOWN
values, or we need to change the signature to IsDir() (bool, error)
to reflect the fact that it too may call Lstat realtime -- I vote keeping the signature as Russ has it and having ReadDir call Lstat in that case (should be rare, so falling back to slower but correct seems fine).
2) The name ReadDir
seems to similar to the existing Readdir
-- differentiating just on the case of one letter seems like asking for trouble. What about ReadEntries
?
3) Presumably IsDir
and Info
return info about the entry itself, without following symlinks? In other words, it's like Lstat and not Stat? Does this need mentioning explicitly? Or maybe it's clear from "the entry"?
4) What does IsDir
return in the case of a symbolic link (to a directory)? Presumably false given the above.
This looks great.
I felt that there may be some advantage to making DirEntry
a subset of FileInfo
interface, hence I didn't include the Info
method and started with the smaller interface. However, I can't convince myself there there is any tangible disadvantage to adding Info
. There certainly needs to be a way of "upgrading" a DirEntry
into a FileInfo
.
One comment from #41188 regarding this suggestion was that d_type
is not always available for IsDir
. In those cases extra work needs to be done to provide that information. Name
and IsDir
are the minimum required to support implementing walk. Upgrading to FileInfo
would be free on those systems, so it's no more expensive to do it this way.
It might be good to rename ReadDir
due to the unfortunate similarity with Readdir
. I understand it works in practice, but it might cause some confusion ("ReadDir" with the capital "D" or lowercase "d"?).
From my point of view, I think the ReadDir
effectively replaces the need for Readdir
and Readdirnames
. If it existed from Go1 I wouldn't see the need for adding the other calls. Obviously the existing methods shouldn't be removed, but they may not be needed in io/fs
.
I have a number tools that will benefit significantly from this proposal.
@benhoyt IsDir
should return bool
. As I mentioned on #41188 there is an advantage to providing interfaces that are guaranteed to work. Systems that cannot provide "IsDir" will need to do extra work up front. However, it will be effectively free for them to upgrade to FileInfo, so it should be no slower to "walk" a tree.
Good point regarding IsDir and symlinks. I think it needs to return false. Following a symlink will necessarily require another Stat to determine whether the target is a directory. This is probably worth documenting.
Presumably IsDir and Info return info about the entry itself, without following symlinks? In other words, it's like Lstat and not Stat? Does this need mentioning explicitly? Or maybe it's clear from "the entry"?
Regardless of which way this ends up working (Lstat vs Stat), it should be mentioned explicitly in the comments/docs for the method, as what is "obvious" to one person will not be to another.
I felt that there may be some advantage to making
DirEntry
a subset ofFileInfo
interface, hence I didn't include theInfo
method and started with the smaller interface. However, I can't convince myself there there is any tangible disadvantage to addingInfo
. There certainly needs to be a way of "upgrading" aDirEntry
into aFileInfo
.
I personally don't see a point in this. Implementations could simply just return an internal FileInfo
or itself when Info()
is called.
@benhoyt, added answers above but (1) I agree with you; (2) ReadEntries is the wrong long-term name, and in the long term Readdir will just be deprecated (but not removed) and fade away; (3) yes, Lstat; (4) yes, false.
Russ, glad to see a new API proposed, thanks. Some Q's
This entails a per-item lstat()
when the type is absent in native dirent. You don't want that if you don't need to check DirEntry.IsDir()
for every result. Can it be optional?
Does DirEntry.Info()
cache its result if it had to lstat()
, or if ReadDir()
had to?
Could we flush the FileInfo
cache, so a subsequent .Info()
reloads it?
Wouldn't the different behavior of DirEntry.Info()
on unix & Windows cause trouble in cross-platform apps? Such apps would call os.File.Readdir()
when they need FileInfo
for all results, and xfs.File.ReadDir()
otherwise. The mix of those two could be confusing.
There is no way to get the Windows fileId (analagous to inode) via FileInfo
. Could a way be added to DirEntry
?
Unix dirents include the inode -- needed for tree replication. Could DirEntry
provide that?
For the latter two, I think you'd return a type FileId interface { ... }
EDIT: Background... I'm building a desktop app for Windows, MacOS, and Linux -- and a server app for Linux -- that make heavy use of the filesystem. Go has thrown me a fair number of curve balls, and I expect more :-/
This entails a per-item
lstat()
when the type is absent in native dirent. You don't want that if you don't need to checkDirEntry.IsDir()
for every result. Can it be optional?
As far as I know, no, this cannot be optional. Perhaps it could be if a more optimized Readdirnames
could be made, but personally, I think that should stay there.
Does
DirEntry.Info()
cache its result if it had tolstat()
, or ifReadDir()
had to?Could we flush the
FileInfo
cache, so a subsequent.Info()
reloads it?Wouldn't the different behavior of
DirEntry.Info()
on unix & Windows cause trouble in cross-platform apps? Such apps would callos.File.Readdir()
when they needFileInfo
for all results, andxfs.File.ReadDir()
otherwise. The mix of those two could be confusing.
If we're going with my idea of returning an internal FileInfo
, then I can see the behavior differing a little bit: on Windows, a FileInfo
can be returned, but in Linux, a new FileInfo
would be obtained on every Info()
call.
In my honest opinion, this behavior is fine. In fact, it would be preferable if FileInfo
of the Linux implementation is cached. The reason being that if the caller wants the newest information, then they should call Lstat
or Stat
explicitly.
There is no way to get the Windows fileId (analagous to inode) via
FileInfo
. Could a way be added toDirEntry
?Unix dirents include the inode -- needed for tree replication. Could
DirEntry
provide that?
Worst case scenario, the DirEnt
interface could perhaps be overloaded to accommodate for the above flaws, but I feel like that would be overdesigning and overcomplicating it.
Could we flush the FileInfo cache, so a subsequent .Info() reloads it?
Any caller needing to guarantee a fresh stat should call Lstat
/Stat
themselves.
There is no way to get the Windows fileId (analagous to inode) via FileInfo. Could a way be added to DirEntry?
I've wanted to retrieve unique file identifiers for both Linux (inode) and Windows (fileId) to improve performance for several applications, so I'd appreciate a solution too. However, I think DirEntry
should be kept simple. Most callers don't need a unique identifier, and it would unnecessarily complicate many implementations which don't use/need it. I think this could be solved via a separate proposal after this "DirEntry" proposal is settled. Eg, perhaps it could be a provided via an optional interface FileID() uint64
.
I agree with what @diamondburned and @mpx said but just to reply directly as well.
This entails a per-item lstat() when the type is absent in native dirent. You don't want that if you don't need to check DirEntry.IsDir() for every result. Can it be optional?
In practice, when is that type absent? I'm not too worried if, say, VFAT file systems are slower to access.
Does DirEntry.Info() cache its result if it had to lstat(), or if ReadDir() had to?
The doc says the Info is either from the time of the ReadDir or the time of the Info call. So a stat during ReadDir can be cached, but not a stat during an earlier Info.
Could we flush the FileInfo cache, so a subsequent .Info() reloads it?
Adding cache state & manipulation to the API is needlessly complex. Lstat still exists.
Wouldn't the different behavior of DirEntry.Info() on unix & Windows cause trouble in cross-platform apps?
There is no magic wand here. Unix and Windows are different. They will always be different.
The path separator alone causes trouble in cross-platform apps. The answer is to use APIs like filepath.Clean and filepath.Join as appropriate. If you don't do that, you have trouble. It's not our job to make trouble impossible, only avoidable.
The docs on ReadDir are very clear about what you can rely on and what you cannot. If you rely on more than that, again, you will have trouble, same as hard-coding use of / or \ in your program.
Such apps would call os.File.Readdir() when they need FileInfo for all results, and xfs.File.ReadDir() otherwise. The mix of those two could be confusing.
Pretty much all apps should probably move to calling ReadDir or, if it matters, Lstat/Stat directly.
What's confusing is having two different APIs, but we clearly need a new one, and can't remove the old one. So be it.
There is no way to get the Windows fileId (analagous to inode) via FileInfo. Could a way be added to DirEntry? Unix dirents include the inode -- needed for tree replication. Could DirEntry provide that? For the latter two, I think you'd return a type FileId interface { ... }
[Note that Go would spell it FileID not FileId (the file does not have a mind).]
The concept seems too special-purpose for a general interface, and a bit difficult to use correctly. The problem is that Windows file IDs and Unix inode numbers are not really identifiers: they only identify a file within a particular file system. Another file system can have a file with the same file ID/inode number. So to use them properly you need to combine them with some kind of identifier for the file system itself (like fsid_t or dev_t on Unix). It gets messy fast.
Also, if you are writing "tree replication", then you are already stat'ing all the files to get the other metadata, and you're already writing very OS-specific code to preserve all the OS-specific attributes. That same code can easily grab the info you need as far as inode number and file system identifier.
It's not our job to union together all the possible APIs on all the possible systems. Our job is to find a simple API that is enough for the vast majority of Go programs, with an option for the rest to get at what they need (FileInfo.Sys in this case).
What about Readdirnames
?
If we expect all users of Readdirnames
to switch to ReadDir
over time, then I'm not quite convinced that IsDir
should not return (bool, error)
. Neither AIX nor Solaris record the type of the file in the directory entry. Even on GNU/Linux there are various obscure file systems that do not record the dirent type.
The downside of IsDir
returning (bool, error)
seems fairly low to me.
Ian, wouldn't the error you're suggesting be returned by ReadDir()
when it tries lstat()
after seeing entries without types?
Or are you suggesting that ReadDir()
should never lstat()
?
I'm suggesting that ReadDir
should never call lstat
, just as the current Readdirnames
never calls lstat
.
Using the same name with different returns would mean a single type couldn't be both a DirEntry and a FileInfo
If we expect all users of Readdirnames to switch to ReadDir over time, then I'm not quite convinced that IsDir should not return (bool, error).
I don't think this should be the case, which is why I said "that should stay there." I think Readdirnames
should use its own dirEnt
s to not require calling lstat
or stat
, which should allow people who don't need IsDir
to skip having to call a few extra stat
s.
I also think that IsDir() (bool, error)
would be a bit too verbose.
[ reposted after edits for clarity ]
It will cause confusion and bugs if .Info()
& .IsDir()
return new data every time on some filesystems but cached data on others. We performed a read, so we'd expect them to return whatever was retrieved by ReadDir()
(possibly nothing). To get new data there's Lstat()
as you said.
You could attain either best performance or cross-filesystem consistency if you could ask ReadDir()
to:
.Name()
-- never lstat
(Ian's suggestion).Name()
, .IsDir()
-- possibly lstat
(Russ' suggestion).Name()
, .IsDir()
, .Info()
-- always lstat
except on Windows (os.File.Readdir)Let's not sacrifice sanity for the sake of simplicity.
if you are writing "tree replication", then you are already stat'ing all the files to get the other metadata
Actually, you don't need to lstat()
if the inode matches one you've already seen.
an option for the rest to get at what they need (FileInfo.Sys in this case).
To clarify, there is no way to get the Windows fileID from FileInfo.Sys
. You have to patch the stdlib at present; see
https://groups.google.com/g/golang-dev/c/raE01Fa2Kmo/m/3Ah9JslnBgAJ
BTW Windows accepts /
path separators; you only need Windows-native syntax in rare cases :-)
It will cause confusion and bugs if .Info() & .IsDir() return new data every time on some filesystems but cached data on others.
Sorry, but that's the design constraint here: it must be possible to either return info learned during ReadDir or info learned during Info. Otherwise you are overfitting to either Unix or non-Unix and penalizing the other.
Neither AIX nor Solaris record the type of the file in the directory entry.
Thanks for pointing this out. That's unfortunate but good to know.
Sorry, but that's the design constraint here: it must be possible to either return info learned during ReadDir or info learned during Info. Otherwise you are overfitting to either Unix or non-Unix and penalizing the other.
I think the argument is about the second call to Info, not the first:
de.Info() // maybe cached depending on os/fs
de.Info() // always cached in all cases for this and all subsequent calls
os.File.Readdirnames()
yields predictable results & performance.
os.File.Readdir()
yields predictable results, but unpredictable performance.
The proposed File.ReadDir()
yields unpredictable results & performance. It won't supplant os.File.Readdir
or os.File.Readdirnames
-- we'll still need both.
We can fix that with a simple adjustment to the API:
func (f *File) ReadDir(k deKind, n int) ([]DirEntry, error) // k specifies cached results
type deKind int
const (
KindName deKind = iota // cache Name; possibly IsDir & Info
KindIsdir // cache Name & IsDir; possibly Info
KindInfo // cache all
)
type DirEntry interface { // methods do not lstat()
Name() string
IsDir() (bool, error) // return error if dirent type not cached
Info() (FileInfo, error) // return error if FileInfo not cached
}
I think the argument is about the second call to Info, not the first:
The second .Info()
call is a problem, but so is the first if it occurs after the directory (or any parent) is moved or renamed.
. cc @tv42 @israel-lugo
A new enum argument is too verbose for something that simple, in my opinion. Furthermore, needing both Readdir
and Readdirnames
is fine, or perhaps, this could be ReadDirEnt
.
Why should ReadDirEnt
replace Readdir
and Readdirnames
anyway? Readdir
is for when you need the entire FileInfo
, and Readdirnames
is for when you only need the names of the directory. I see no reasons for ReadDirEnt
to replace both of them other than trying to save a few lines of code in the implementation of those two functions.
I would also like to point out again that having IsDir()
return an error is really verbose. I would much prefer an lstat
fallback over having to handle errors from IsDir()
. Not to mention, having that function call lstat
later and then cache only makes it more unpredictable.
After f.ReadDir(fs.KindIsdir, n)
you don't have to check the .IsDir()
error.
After f.ReadDir(fs.KindInfo, n)
you don't have to check the .Info()
or .IsDir()
errors.
There is no caching by DirEntry
methods. My amendment entails zero lstats by DirEntry
methods.
Note: the proposed File.ReadDir
would be the only directory API in io/fs.
EDIT: Predictable = produce the same results for a raceless sequence of events on all filesystems. My amendment provides that for all cases. (The following post is incorrect.)
My amendment entails zero lstats by
DirEntry
methods.
IsDir()
would need an lstat
to get the boolean though. The proposal only ensures predictable behavior for some few use-cases that could be done with the old Readdir
.
Now that I think about it, I think trying to guarantee a consistent behavior with this API is almost moot. There isn't a global file system lock to be acquired, so changes could still happen while Readdir
is looping. I don't think there's a point in ensuring predictable behavior if we cannot fully get rid of it, since the cost of having a more verbose API is worse, in my opinion.
I amended my previous comment to clarify.
Allowing/requiring a DirEntry
to perform lstat
on demand for IsDir
would result in a poor API:
error
return which complicates using the APIIsDir
would have a different signature to FileInfo.IsDir
(somewhat confusing). It will be desirable for some io/fs
implementations to use the same type for DirEntry
and FileInfo
- we shouldn't prevent that. This will likely be common in many simpler implementations.IsDir
lstat
response (more complicated non-immutable DirEntry
), or discarding the response and causing duplicate lstat
for callers via Info
. This goes against the reason for this interface in the first place.We should not require complex caching/reuse logic to be implemented for DirEntry
. I'd strongly prefer encouraging immutable implementations for DirEntry
: they are easier to implement correctly and harder to misuse inadvertantly. ReadDir
should return an immutable value with all the details needed to satisfy the interface (IsDir
).
How common is Readdirnames in performance code? I suspect there will be less overall performance impact since most platforms provide enough details to support walking a tree within their dirents
, and that the current walk performance is a larger problem for most users. If so, keeping IsDir() bool
in the DirEntry
interface is probably still a better outcome overall.
Alternatively, we could consider making IsDir() bool
an optional interface.
This would enable DirEntry
implementations to provide an immutable type with all the details available in the dirent, and allowing efficient tree walking where practical. Each platform could provide the most efficient implementation available (names only, names + directory?, names + stat). This improves the case for ReadDir
replacing Readdirnames
and Readdir
everywhere (which is good for the io/fs
proposal).
Using an optional method complicates using the interface, but less than providing an error
since presence of a method is binary (the information either exists, or it doesn't). Providing an error
would require checking for specific errors and handling implementation variation (as well as the disadvantages above).
Separately, I'd prefer to see something like FileType() FileType
instead of IsDir() bool
to fix the conflation of file types and permissions, and allow more options when walking DirEntry
. File types are often available in dirents (unlike permissions). File types are an enumeration but permissions are a bit field - FileMode
merges them and makes it awkward to provide access to file types via DirEntry
.
type DirEntry interface {
Name() string
FileType() FileType // Optional?
Info() (FileInfo, error)
}
I understand there are reasons for sticking with IsDir() bool
since it's a common pre-existing signature. I'd take IsDir() bool
if it was the only acceptable choice for a new interface that supports efficiently processing dirents (optional or not).
Using optional interfaces might make it more practical to introduce later (although a second change to the interface is probably unlikely).
Alternatively, we could consider making IsDir() bool an optional interface.
The best current practice of optional interfaces would mean that optional IsDir would have to return (bool, error)
, just to enable wrapping.
We're not putting flags into the argument to ReadDir. That's too much API complexity.
Let's call the systems that don't return IsDir bits in the readdir system call "old". For concreteness, those systems are AIX, Solaris, and certain old file systems on other systems (unclear which, but not very many).
It seems like the options are:
Given how rare the problem seems to be, it seems like we should probably take option (3), keeping these old systems from causing API complexity on all the others.
If this is added, will package io/fs get a fast Walk implementation that uses ReadDir? Will filepath.Walk be deprecated? It would be exciting if this could close #16399 too.
@carlmjohnson Yes, one of the goals here is to establish an API for io/fs that enables a standard fast walk API, probably similar to github.com/kr/fs Walker, and close #16399.
Would old fses cache the Info like on Windows?
Should Info be an optional method with an Info function that grabs the provided info or falls back to making a Stat?
Would old fses cache the Info like on Windows?
Yes. For clarity, there are four cases:
Plan 9 and Windows return full metadata during directory read. On those systems, ReadDir would save that info. IsDir and Info would both use that saved info.
Most Unix systems using most file systems return name, inode number, and inode type bits during directory read. On those systems, ReadDir would save that (more limited) info. IsDir would use the type bits to answer. Info would call Lstat each time it is called. If the Lstat fails, it returns an error. If the inode number or type bits don't match, it also returns an error (ErrNotExist probably).
Old Unix systems (AIX, Solaris) return no type bits during directory read. On those systems, ReadDir would call Lstat for each directory entry and save that info, as Readdir does on all Unix systems. IsDir and Info would both use that saved info.
Newer Unix systems can return no type bits (DT_UNKNOWN) during directory read for certain old file systems. On those systems, ReadDir would recognize DT_UNKNOWN and fall back to the "old Unix system" behavior.
Code using ReadDir and Info must be written so that it does not care whether the returned information is from the time of the ReadDir or the time of the Info call, in order to avoid optimizing for and overfitting to one operating system over another. A typical ReadDir+loop over entries inspecting them immediately fits this requirement - if there are racing changes to the underlying directory, it is already indeterminate whether the change would be observed or not, regardless of whether the observation is made during ReadDir or Info.
Should Info be an optional method with an Info function that grabs the provided info or falls back to making a Stat?
I'm sorry, I didn't understand that, but it seems important to me that Info is a required method. Otherwise portable code that just wants to get some extra info is greatly complicated. That's a use case we do want to support well.
My thought was an optional Info() FileInfo
method only on entries with cached data and an Info(DirEntry) (FileInfo, error)
that returned the cached data or made the stat, but it would probably need to peek into the DirEntry's implementation to figure out what to stat without more API so retracted.
Could you add IsRegular() too? So that one can do the special case handling only on !(IsRegular() || IsDir())
then? Otherwise it is not decidable whether one needs to descend or not given IsDir() == false
only. I am not sure how cheap IsRegular()
is to determine on Windows, but maybe @alexbrainman could chime in there.
@nightlyone, do you mean in the case of a symlink (pointing to a directory)? It strikes me that usually you want to walk a directory tree not following symlinks (I could be wrong here?), but in the cases where you do want to walk a tree following symlinks, just IsDir
isn't enough. Because if IsDir
returns false, it might be a symlink (pointing to a directory), and you should recurse in. But you don't know that unless you call Lstat
on every non-directory entry ... which is a waste.
So I think we need IsRegular
too (or IsSymlink
, but there's a prececent for IsRegular
in FileMode
). Otherwise anyone who wants to walk following symlinks (on Linux) is back to basically an lstat syscall per file, defeating the purpose of this new API.
Alternatively, instead of IsDir
we could have a new FileType
method as @mpx suggested above (though I'd suggest the signature EntryType() EntryType
) which would return EntryDir
, EntryFile
, EntrySymlink
, or EntryOther
. And IsDir()
would becomes e.EntryType() == os.EntryDir
(or we could keep IsDir
in addition).
FWIW I took a look at all the filepath.Walk
calls in the Go code I have handy (a large private Go codebase plus a few open source projects) and classified them based on how they used the os.FileInfo
argument:
what is used | count |
---|---|
info is unused |
1 |
only info.Name() |
2 |
only info.IsDir() |
13 |
info.Name() and info.IsDir() |
1 |
only info.Mode().IsRegular() |
1 |
info.IsDir() plus info.Mode()&os.ModeSymlink |
1 |
info.IsDir() plus info.Mode() (using the entire FileMode as, for example, an arg to os.Chmod ) |
1 |
multiple methods (ModTime , Size , etc) |
3 |
Given that the file mode is accessed several times for purposes other than IsDir()
, maybe DirEntry
should return an EntryType
as @benhoyt suggested?
I'm sorry, I cannot support an API that yields unpredictable results[1] for DirEntry.Info()
. It's contrary to the effect of os.Readdir
, and users would miss the fine distinction. The resulting bugs would be silent and hard to find. This would affect programs ported between Windows and Linux or MacOS, a common scenario.
Any lstat()
after ReadDir()
should be explicit. @jimmyfrasche's suggestion in https://github.com/golang/go/issues/41467#issuecomment-697924028 and mine in https://github.com/golang/go/issues/41467#issuecomment-696891178 would fix it. There are other options, as well.
Re lstats in order to fulfill .IsDir()
on "old unix" filesystems, it's odd to deny those systems an io/fs API which accomplishes os.Readdirnames
. But at least the results are predictable.
I agree that .EntryType()
is preferable to .IsDir()
. But let's provide both.
, [1] Predictable = the same results for a raceless sequence of events on all filesystems.
I'm sorry, I cannot support an API that yields unpredictable results[1] for
DirEntry.Info()
. It's contrary to the effect ofos.Readdir
, and users would miss the fine distinction. The resulting bugs would be silent and hard to find. This would affect programs ported between Windows and Linux or MacOS, a common scenario. Anylstat()
afterReadDir()
should be explicit. @jimmyfrasche's suggestion in #41467 (comment) and mine in #41467 (comment) would fix it. There are other options, as well
Why not use Readdir
then? I don't think either Readdir
or Readdirnames
are going away, so if you want completely "predictable" behaviors (which aren't completely predictable), why not just use the former 2?
I agree that
.EntryType()
is preferable to.IsDir()
. But let's provide both.
Providing both would be pretty redundant. I think it should be one or the other, and if it's ever EntryType
, then one could perhaps do ent.EntryType().IsDir()
.
@networkimprov On Unix there is inevitably a separation between the time that the directory is read and the time that the program calls lstat
on the directory entry. Further, there is inevitably a separation between the time the program calls lstat
and the time that the program uses the returned FileInfo
. Races are always possible, and there is no way to avoid that with current Unix APIs. So it's not clear to me what kind of cases can lead to real problems. Can you sketch out an example? Thanks.
Ian, I did say "raceless sequence of events". For example, I reuse directory listings:
f, err := os.Open("mydir")
if err ...
defer f.Close()
dir, err := f.ReadDir(0)
if err ...
for _, de := range dir {
info, err := de.Info() // not required for failure below
if err ...
if info.ModTime().After(timeMin) {
err = os.Rename("mydir/"+ de.Name(), "later/"+ de.Name())
if err ...
}
}
...
for _, de := range dir {
if strings.HasSuffix(de.Name(), ".pdf") {
info, err := de.Info() // intermittent failure on Linux & MacOS
if err ...
if info.ModTime().After(timeMin) { continue }
// further work
}
}
That at least results in an error. The situation is worse if you edit a file, changing its size & modtime.
@diamondburned pls read https://go.googlesource.com/proposal/+/master/design/draft-iofs.md
@networkimprov This issue only concerns itself with os.File
though (and so is your example code). If you want consistent results on future reads for whatever API that will be added to io/fs
, just call Info()
once on all of the entries to guarantee that they're saved.
@networkimprov Thanks. If I understand correctly, your concern is that if you call the Info
method multiple times, it will not always return the same result. I think we can address that through documentation. After all, it's already true of os.Lstat
.
@diamondburned The proposal is for Info
to return the FileInfo
from the time ReadDir
was called, or the time of current call to Info
. There is no intention for Info
implementations to cache their result - and we shouldn't encourage it either for the reasons I outline above.
Similar to calling Stat
, any code that requires a fixed FileInfo
to operate correctly needs to store and reuse it themselves. Each call to Stat
or Info
potentially results in a different value, there is no sensible way around it.
Ian, the problem is that .Info()
always returns the initial result on Windows & "old unix", but always reads from storage ~cannot have an initial result~ on Linux & MacOS. I do not believe documentation is an acceptable remedy.
os.Lstat()
is an explicit request to read storage, on all platforms. Its result is predictable in raceless code.
@diamondburned you have mistakenly contradicted me several times; kindly assume that my facts are correct.
Ian, the problem is that .Info() always returns the initial result on Windows & "old unix", but cannot have an initial result on Linux & MacOS. I do not believe documentation is an acceptable remedy.
What's an "initial result"? getdents+lstat is exactly as "non-initial" as getdents+sleep+lstat, the only difference is the size of the race window.
Your two loops assume they see the same os.FileInfo. If ReadDir Info is documented to not have that guarantee, there's no problem here, and you just collect a []os.FileInfo
when you require a snapshot.
Expanding on the desire for a FileType
... (or EntryType
-- I'm less concerned with the naming and more interested in the concept).
When walking a tree there are 3 file types that are generally useful to recognise:
Custom handling for other irregular/special files is less common (eg, fifos, devices, unix sockets,..).
IsDir
only provides the first which helps, but isn't enough to reduce the need to call lstat
for the other 2 cases (regular files,
symlinks).
A single method could provide these details avoiding the need for another lstat
, or separate IsDir
, IsRegular
, IsSymlink
methods. Typically file types either exist in dirents, or they don't. Hence I don't see any reason why they shouldn't be made available via DirEntry
(vs being limited to IsDir
).
There are probably 2 options for providing the FileType
enumeration:
os.FileMode
but explicitly document that permissions will never be set. Pro: doesn't require another type/constants.I prefer option 2, but option 1 would still solve the problem.
..and while limited, IsDir
is an improvement over the status quo.
FWIW, filepath.Walk does not follow symbolic links. My assumption is that fs.Walk will not either. Following symlinks without getting lost requires some kind of Tarjan algorithm for detecting cycles, and I don't think we want to burden regular fs.Walk with that, especially since part of the goal is resolving #16399 by making file walking faster, not slower. As long as IsDir() is false for symlinks, fs.Walk should skip past the symlink, and everything is fine, no?
os.File provides two ways to read a directory: Readdirnames returns a list of the names of the directory entries, and Readdir returns the names along with stat information.
On Plan 9 and Windows, Readdir can be implemented with only a directory read - the directory read operation provides the full stat information.
But many Go users use Unix systems.
On most Unix systems, the directory read does not provide full stat information. So the implementation of Readdir reads the names from the directory and then calls Lstat for each file. This is fairly expensive.
Much of the time, such as in the implementation of file system walking, the only information the caller of Readdir really needs is the name and whether the name denotes a directory. On most Unix systems, that single bit of information—is this name a directory?—is available from the plain directory read, without an additional stat. If the caller is only using that bit, the extra Lstat calls are unnecessary and slow. (Goimports, for example, has its own directory walker to avoid this cost.) In fact, a survey of existing Go code found that only about 10% of uses of ReadDir actually need more than names and is-directory bits.
It appears that a third way to read directories should be added, to let all this code be written more efficiently. Expanding on a suggestion by @mpx, I propose to add:
The FS proposal would then adopt this ReadDir and ignore Readdir entirely.
In #41188 I wrote:
I still believe that, but the survey convinced me that nearly all existing Readdir uses fall into this category, so it's not quite so bad to provide an optimized path for Unix systems. The DirEntry.Info method specification above allows both the eager info loading of Plan 9/Windows and the lazy loading needed on Unix. In contrast to #41188, the laziness is explicitly allowed from the beginning, and failures of the lazy loading can be reported in the error result.
Thoughts?
Update: A few clarifications to common questions:
Update 2: A few changes were made along the way to acceptnce. See https://github.com/golang/go/issues/41467#issuecomment-708536303 for the final version.