golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.31k stars 17.47k forks source link

os: add ReadDir method for lightweight directory reading #41467

Closed rsc closed 3 years ago

rsc commented 3 years ago

os.File provides two ways to read a directory: Readdirnames returns a list of the names of the directory entries, and Readdir returns the names along with stat information.

On Plan 9 and Windows, Readdir can be implemented with only a directory read - the directory read operation provides the full stat information.

But many Go users use Unix systems.

On most Unix systems, the directory read does not provide full stat information. So the implementation of Readdir reads the names from the directory and then calls Lstat for each file. This is fairly expensive.

Much of the time, such as in the implementation of file system walking, the only information the caller of Readdir really needs is the name and whether the name denotes a directory. On most Unix systems, that single bit of information—is this name a directory?—is available from the plain directory read, without an additional stat. If the caller is only using that bit, the extra Lstat calls are unnecessary and slow. (Goimports, for example, has its own directory walker to avoid this cost.) In fact, a survey of existing Go code found that only about 10% of uses of ReadDir actually need more than names and is-directory bits.

It appears that a third way to read directories should be added, to let all this code be written more efficiently. Expanding on a suggestion by @mpx, I propose to add:

// ReadDir reads the contents of the directory associated with the file f
// and returns a slice of DirEntry values in directory order.
// Subsequent calls on the same file will yield later DirEntry records in the directory.
//
// If n > 0, ReadDir returns at most n DirEntry records.
// In this case, if ReadDir returns an empty slice, it will return an error explaining why.
// At the end of a directory, the error is io.EOF.
//
// If n <= 0, ReadDir returns all the DirEntry records remaining in the directory.
// When it succeeds, it returns a nil error (not io.EOF).
func (f *File) ReadDir(n int) ([]DirEntry, error) 

// A DirEntry is an entry read from a directory (using the ReadDir method).
type DirEntry interface {
    // Name returns the name of the file (or subdirectory) described by the entry.
    // This name is only the final element of the path, not the entire path.
    // For example, Name would return "hello.go" not "/home/gopher/hello.go".
    Name() string

    // IsDir reports whether the entry describes a subdirectory.
    IsDir() bool

    // Info returns the FileInfo for the file or subdirectory described by the entry.
    // The returned FileInfo may be from the time of the original directory read
    // or from the time of the call to Info. If the file has been removed or renamed
    // since the directory read, Info may return an error satisfying errors.Is(err, ErrNotExist).
    // If the entry denotes a symbolic link, Info reports the information about the link itself,
    // not the link's target.
    Info() (FileInfo, error)
}

The FS proposal would then adopt this ReadDir and ignore Readdir entirely.

In #41188 I wrote:

Various people have proposed adding a third directory reading option of one form or another, to get names and IsDir bits. This would certainly address the slow directory walk issue on Unix systems, but it seems like overfitting to Unix.

I still believe that, but the survey convinced me that nearly all existing Readdir uses fall into this category, so it's not quite so bad to provide an optimized path for Unix systems. The DirEntry.Info method specification above allows both the eager info loading of Plan 9/Windows and the lazy loading needed on Unix. In contrast to #41188, the laziness is explicitly allowed from the beginning, and failures of the lazy loading can be reported in the error result.

Thoughts?

Update: A few clarifications to common questions:

Update 2: A few changes were made along the way to acceptnce. See https://github.com/golang/go/issues/41467#issuecomment-708536303 for the final version.

mpx commented 3 years ago

Just to clarify, I don't think fs.Walk should follow symlinks, but ideally it should be possible to implement efficiently outside of the standard library when needed (eg, similar to find -L, du -L,..). As above, a more concrete understanding of the file type may be necessary to efficiently choose the correct action, even if that action is ignoring symlinks/devices/fifos/...

mpx commented 3 years ago

I originally framed providing a FileType method as "nice-to-have", but I now think it's actually required to obtain a lot of the performance benefit on modern Unix systems.

Directories are typically less common than files. IsDir avoids lstat for directories but it will be needed to confirm whether the remaining entries are regular files (or something else).

Otherwise, developers will be encouraged to incorrectly asssume !IsDir() indicates a regular file, resulting in latent bugs. Symlinks may be incorrectly opened (directory or other?), or FIFOs/devices incorrectly opened. This could result in otherwise avoidable errors/failures, or worse. The alternative would be sacrificing performance with another lstat.

interface {
  // EntryType provides the filetype.
  EntryType() EntryType

  // OR //

  // Only ModeType bits will be set. Permissions bits are not provided.
  FileType() os.FileMode
}

type EntryType int

const (
  EntryDir EntryType = iota + 1
  EntryFile
  EntrySymlink
  // ...
)

EntryType is similar to @benhoyt's recommendation above, but without providing EntryOther. Providing a "catch-all" entry type would prevent adding further constants since the meaning of EntryOther would change (hence broken compatibility).

My preference is for EntryType since it's easier to correctly implement and consume as an API. However, os.FileMode would still provide the desired performance benefit.

networkimprov commented 3 years ago

I am opposed to this proposal as currently drafted. To recap, the new {os,fs}.File.ReadDir won't retrieve FileInfo in some cases, unlike os.File.Readdir, so DirEntry.Info() sometimes reads from storage. Such implicit, unpredictable reads post-ReadDir will cause bugs, for example https://github.com/golang/go/issues/41467#issuecomment-698045156.

I've spent years writing cross-platform code calling the Go filesystem API. (Given the bugs and gaps I've found, I'd guess few folks have done much of that.) I've a sound basis to assert that cross-filesystem consistency is crucial. The problem must be fixed, not "documented" and left to bite ppl who expect the stdlib to be rational.

A fix outlined in https://github.com/golang/go/issues/41467#issuecomment-696891178 and clarified in https://github.com/golang/go/issues/41467#issuecomment-696928425 is trivial, and has other benefits.

Alternatively, io/fs can go ahead with its original API, ReadDir(n int) ([]FileInfo, error)

@jimmyfrasche has also raised this issue in https://github.com/golang/go/issues/41467#issuecomment-697924028 and https://github.com/golang/go/issues/41467#issuecomment-697838941.

jimmyfrasche commented 3 years ago

My current position is that I think it's a little weird that Info may either always returned a cached result or always stats and that it would be more uniform to cache the first call of Info on fses that don't already get cached. However this probably won't be an issue since you would generally call Info once and then discard the DirEntry in favor of the FileInfo and call stat yourself if you needed up to date info later. Maybe documentation or examples would be sufficient to note that.

networkimprov commented 3 years ago

What does "call Info once and then discard the DirEntry in favor of the FileInfo" look like? That sounds like fd.Readdir(n).

mpx commented 3 years ago

Compromise will be required to provide a simple API for implementers/users that performs well on most systems.

The proposed single ReadDir call (with EntryType) would simplify io/fs implementations and provide good performance for most usage patterns & platforms. Caching is required to avoid throwing away work/performance on some platforms (due to the names+type and names+stat variation).

All the alternatives I've seen so far either limit performance by favouring a particular system API (names only, names+type, names+stat), or they are more complex:

I think the documented caching in this proposal will be unlikely to cause a problem in practice - based on @rsc's review and the cross platform tooling I've written. Documentation can make it clear that any code requiring a fresh FileInfo/DirEntry should call Stat (which is true anyway).

I still think ReadDir w/EntryType is the best compromise.

-- ^ Multiple methods force the caller to choose how much detail they need in advance, but this is typically an impossible choice to make per directory or entry.

rsc commented 3 years ago

@networkimprov, I hear you about Linux and Windows behaving differently. That is an explicit goal - maybe the explicit goal - of this proposal, so that it is possible to write a fast file walk on both systems. We are not going to compromise that.

rsc commented 3 years ago

As for the other bits, we could potentially add Type() os.FileMode that only includes the type bits, not the permission bits not setuid/setgid/sticky. That would be OK. Then we would have:

// A DirEntry is an entry read from a directory (using the ReadDir method).
type DirEntry interface {
    // Name returns the name of the file (or subdirectory) described by the entry.
    // This name is only the final element of the path, not the entire path.
    // For example, Name would return "hello.go" not "/home/gopher/hello.go".
    Name() string

    // IsDir reports whether the entry describes a subdirectory.
    IsDir() bool

    // Type returns the type bits for the entry.
    // The type bits are a subset of the usual os.FileMode bits, the ones that cannot be changed after file creation.
    // That is, the type bits exclude the permission bits as well as ModeAppend, ModeTemporary,
    // ModeSetuid, ModeSetgid, and ModeSticky.
    Type() os.FileMode

    // Info returns the FileInfo for the file or subdirectory described by the entry.
    // The returned FileInfo may be from the time of the original directory read
    // or from the time of the call to Info. If the file has been removed or renamed
    // since the directory read, Info may return an error satisfying errors.Is(err, ErrNotExist).
    // If the entry denotes a symbolic link, Info reports the information about the link itself,
    // not the link's target.
    Info() (FileInfo, error)
}

People who want IsRegular can then use Type().IsRegular().

Are there any remaining objections to this proposal?

randall77 commented 3 years ago

Type().IsDir() works, so do we need IsDir also?

earthboundkid commented 3 years ago

FileMode already has a a Perm method that returns m & ModePerm. Should it gain a Type method that returns m & ModeType just to make DirEntry.Type() more clear?

networkimprov commented 3 years ago

@rsc, the fix I outlined supports maximum performance on all filesystems, and includes the cached-IsDir model from your proposal. Pasting again:

func (f *File) ReadDir(k deKind, n int) ([]DirEntry, error) // k specifies cached results

type deKind int
const (
   KindName deKind = iota  // cache Name; possibly IsDir & Info
   KindIsdir               // cache Name & IsDir; possibly Info
   KindInfo                // cache all
)

type DirEntry interface {   // methods do not lstat()
   Name() string
   IsDir() (bool, error)    // return error if dirent type not cached
   Info() (FileInfo, error) // return error if FileInfo not cached
}

After f.ReadDir(fs.KindIsdir, n) you don't have to check the .IsDir() error. After f.ReadDir(fs.KindInfo, n) you don't have to check the .Info() or .IsDir() errors.

There is no need to discard cross-filesystem consistency to acquire performance. The only criticism of the above has been "it's verbose" which seems petty given its benefits.

ianlancetaylor commented 3 years ago

@networkimprov My understanding is that the benefit of your suggestion is that it specifies the exact time of the lstat (or equivalent) call. Is there another benefit?

That benefit is already achievable with @rsc's suggestion, by calling and caching the methods that you need as soon as you get the DirEntry. It's a little bit more work, but only in the uncommon case.

So unless there is another benefit to your suggestion, I think it is a valid criticism to say that it is verbose.

I'll also note that your suggestion could actually be implemented in terms of @rsc's suggestion.

earthboundkid commented 3 years ago

One consideration is that Readdir (little d) may be deprecated, but it's never (?) going to go away. Users who want KindName behavior can call Readdirnames and users who want KindInfo behavior can call Readdir.

rsc commented 3 years ago

@networkimprov, your suggestion - in addition to being more complex - does not work. Consider a file system walk that wants to recursively find and look at info for all *.go files in a directory tree. That walk needs de.Name() and de.IsDir() for every directory entry but only calls de.Info() if strings.HasSuffix(de.Name(), ".go") is true.

On Windows, the plain directory read will retrieve all the information. The de.Info() for the *.go files can return the cached result and avoid a stat for each *.go file.

On (new enough) Unix, the plain directory read will retrieve the names and type bits. The de.Info() for the *.go files can call stat on demand, as needed, and avoid a stat for all the other files.

If this code uses your fs.KindIsDir, it does unnecessary work on Windows. If this code uses your fs.KindInfo, it does unnecessary work on Unix. The API being proposed in this issue does exactly what is needed and nothing more.

rsc commented 3 years ago

@randall77, we don't strictly need IsDir, but we don't strictly need it in os.FileInfo either, and I'd rather not force people to remember which of these two interfaces have it and which don't.

networkimprov commented 3 years ago

Russ, fs.KindIsdir yields precisely your proposed ReadDir; it does no unnecessary work on Windows.

Ian, Go design typically considers Hyrum's Law. The primary benefit of my suggestion is consistent API behavior. If the source of the lstat is unspecified, programs will rely on whatever occurs where they originate.

Other benefits:

The latter is not uncommon; Russ' analysis of Readdir use found 12% invoke a FileInfo method other than Name() & IsDir(), https://github.com/golang/go/issues/41188#issuecomment-690879673. And the "little bit more work" is:

fi := make([]FileInfo, len(dir))
for i := range dir {
   var err error
   fi[i], err = dir[i].Info()
   if err != nil {
      // evasive action
   }
}

@carlmjohnson io/fs does not offer Readdirnames & Readdir.

earthboundkid commented 3 years ago

io/fs work with filesystems with different underlying capabilities. It seems like a pain to force them to implement all three modes when users are only going to want the best mode the system is capable of, and users may not be able to use a heuristic to find out which mode it is. So then the choice is lowest common denominator or best effort.

ianlancetaylor commented 3 years ago

@networkimprov The point of @rsc's example is that in order to implement that with your suggestion, it is necessary to pass fs.KindInfo, because in some cases the program needs the full Info. But that will cause extra unnecessary work for the cases where the program does not need the full Info. Or, you can pass fs.KindIsDir and call Info later if necessary, but that has the same problems that you are pointing out in @rsc's suggestion. So it seems that your suggestion either does extra work, or has the problems that your suggestion is intended to avoid. So again it seems like a valid criticism to say that your suggestion seems verbose.

networkimprov commented 3 years ago

Ian, fs.KindInfo is not necessary for that example, and fs.KindIsdir has no such problems. Below, there is no mystery about where storage is read, and the extra work is a single Lstat:

dir, err := f.ReadDir(fs.KindIsdir, 0)
if err ...
for i := range dir {
   if ok, _ := dir[i].IsDir(); ok {
      // recur
   }
   if strings.HasSuffix(dir[i].Name(), ".go") {
      fi, err := dir[i].Info()
      if err != nil {
         fi, err = os.Lstat(parent + dir[i].Name())
         if err ...
      }
   }
}

An extra argument, plus an Lstat in 1%-12% of cases, are a small price to pay for an API which cannot break when a program is ported -- or simply run on different hardware. (How is that even debatable??)

@carlmjohnson an io/fs implementation is free to satisfy .Info() for all three options, as Windows does.

ianlancetaylor commented 3 years ago

An extra argument, plus an Lstat in 1%-12% of cases, are a small price to pay for an API which cannot break when a program is ported -- or simply run on different hardware. (How is that even debatable??)

It's not debatable. But I don't agree with the premise. Your argument seems to hinge on the notion that Info might behave differently when called multiple times. But that is also true with the code you wrote just above. It's not true in exactly the same way, but it's still true.

networkimprov commented 3 years ago

.Info() delivers inconsistent results in race-free code if a directory item is changed, or the directory is renamed, before .Info() is invoked. Calling .Info() once will expose that; calling it multiple times is irrelevant.

My code above clarifies the low cost of the API fix I suggested (which you and Russ separately mischaracterized in prior posts). And its storage access is explicit.

I illustrate the design flaw that the fix addresses in https://github.com/golang/go/issues/41467#issuecomment-698045156.

ianlancetaylor commented 3 years ago

What do you mean by "race-free code" or, as you said earlier, a "raceless sequence of events"? I don't know how to understand that in conjunction with your suggestion here that a directory item can be changed, or a directory renamed, before Info is invoked.

A directory can be changed or renamed between the time that the directory entry is called and when lstat is called. That is true whether lstat is called by ReadDir or whether it is called by user code after ReadDir returns.

networkimprov commented 3 years ago

I spelled it out in https://github.com/golang/go/issues/41467#issuecomment-698045156 which iterated the []DirEntry twice in a realistic way. Here's a trivial example:

dir, err := f.ReadDir(n)
err = os.Rename(dirname, newname) // or any change to directory contents
for i := range dir {
   fi, err := dir[i].Info()       // will fail on most Linux & MacOS (works everywhere today)
   ...
}

directory can be changed or renamed between the time that the directory entry is called and when lstat is called

You've described a race condition due to concurrent directory access. The above code fails without that.

mpx commented 3 years ago

@networkimprov, a developer wanting to guarantee the full Info is available can simply do what os.File.Readdir does now - call Info and discard entries that are "not found". This "race condition" cannot be avoided on Unix - it must be handled somewhere.

Eg, Developers could use a helper function if they want to guarantee FileInfo is available (with the corresponding lstat performance disadvantage on modern Unix):

func ReadDirInfo(f fs.File, n int) ([]os.FileInfo, error) {
  d, ok := f.(fs.ReadDirFile)
  if !ok {
    return nil, errors.New("not a directory")
  }
  entries, err := d.ReadDir(n)
  if err != nil {
    return nil, err
  }
  infos := make([]os.FileInfo, 0, len(entries))
  for _, e := range entries {
    fi, err := e.Info()
    if fs.IsNotExist(err) {
      continue
    } else if err != nil {
      return infos, err
    }
    infos = append(infos, fi)
  }
  return infos, nil
}

Accessing the filesystem is fundamentally "racy". Code must to written for this reality (handle errors, coordinate via locking/conventions, etc..).

mpx commented 3 years ago

Re documentation, I think it would be more important to mention that DirEntry.Type only returns the file type bits contained in os.ModeType, rather than listing the bits that aren't contained.

Adding func (os.FileMode) Type() os.FileMode would make it cleaner for implementations to guarantee they are only returning the type bits. Perm already exists, so it would be good to add the corresponding method for the type as well.

The proposal looks pretty good now. I have a mild preference for:

...but I understand there are advantages to maintaining existing conventions and avoiding a new type. I hope it can land in Go1.16, I definitely want to start using it (along with other proposals).

rsc commented 3 years ago

@networkimprov, you've had your say. Please stop. You are derailing the discussion at this point.

earthboundkid commented 3 years ago

I think if there isn't going to be a new EntryType, it makes sense to add func (os.FileMode) Type() os.FileMode to help out other filesystems. E.g. a Zip file FS could just call m.Type() in order to not accidentally provide too much info in a ReadDir call. We could also just document that it's m & os.ModeType, but it seems more likely to be done correctly if there's a method.

rsc commented 3 years ago

@mpx, I'm not sure about the need to add an os.FileMode.Type method, but I'd be happy to define an os.ModeType bit set and say only those bits are included. The positive definition of the type bits is given in the prose: "The type bits are ... the ones that cannot be changed after file creation."

Edit: Now I see ModeType already exists. Even better.

mpx commented 3 years ago

os.FileMode.Type certainly isn't needed, but I think it would be a low cost way of giving a cognitive nudge toward returning only the file type. It will make Type more prominent in the documentation.

There is already a bunch of duplication with DirEntry/FileMode:

I think adding os.FileMode.Type helps more than it hurts since os.FileMode would now be explicltly used to represent type-only, just like it is currently used to represent perm-only. It's also odd that a Perm method exists but Type doesn't.

As above, a better way to avoid confusion would be providing a dedicated file-type type instead continuing to overload os.FileMode (combined type/perm). However, I understand the desire to avoid representing file types 2 different ways (os.FileMode, EntryType).

rsc commented 3 years ago

Given that we have Perm, I'm OK with adding Type (== m & ModeType) as well. Note that there are bits that are neither in Perm nor in Type (various extended attributes like setuid).

So it sounds like we are at:

// ReadDir reads the contents of the directory associated with the file f
// and returns a slice of DirEntry values in directory order.
// Subsequent calls on the same file will yield later DirEntry records in the directory.
//
// If n > 0, ReadDir returns at most n DirEntry records.
// In this case, if ReadDir returns an empty slice, it will return an error explaining why.
// At the end of a directory, the error is io.EOF.
//
// If n <= 0, ReadDir returns all the DirEntry records remaining in the directory.
// When it succeeds, it returns a nil error (not io.EOF).
func (f *File) ReadDir(n int) ([]DirEntry, error) 

// A DirEntry is an entry read from a directory (using the ReadDir method).
type DirEntry interface {
    // Name returns the name of the file (or subdirectory) described by the entry.
    // This name is only the final element of the path, not the entire path.
    // For example, Name would return "hello.go" not "/home/gopher/hello.go".
    Name() string

    // IsDir reports whether the entry describes a subdirectory.
    IsDir() bool

    // Type returns the type bits for the entry.
    // The type bits are a subset of the usual FileMode bits, those returned by the FileMode.Type method.
    Type() os.FileMode

    // Info returns the FileInfo for the file or subdirectory described by the entry.
    // The returned FileInfo may be from the time of the original directory read
    // or from the time of the call to Info. If the file has been removed or renamed
    // since the directory read, Info may return an error satisfying errors.Is(err, ErrNotExist).
    // If the entry denotes a symbolic link, Info reports the information about the link itself,
    // not the link's target.
    Info() (FileInfo, error)
}

// Type returns the type bits (m & ModeType).
func (m FileMode) Type() FileMode { return m & ModeType }

Are there any problems with this?

rsc commented 3 years ago

For the record, I (and @ianlancetaylor and the rest of the proposal review committee) understand and note @networkimprov's strong objections to the "The returned FileInfo may be from the time of the original directory read or from the time of the call to Info." semantics.

But again, that choice is fundamental to having an API that can be implemented efficiently on a variety of systems.

networkimprov commented 3 years ago

I don't wish to resume the debate or aggravate anyone, but I believe I've shown this statement to be inaccurate:

"... that choice is fundamental to having an API that can be implemented efficiently on a variety of systems."

rsc commented 3 years ago

Based on the discussion above, this seems like a likely accept.

israel-lugo commented 3 years ago

This LGTM. The only main thing that seems missing here from #40352 would be the ability to efficiently and portably retrieve the file's ID.

@rsc commented earlier:

The concept seems too special-purpose for a general interface, and a bit difficult to use correctly. The problem is that Windows file IDs and Unix inode numbers are not really identifiers: they only identify a file within a particular file system. Another file system can have a file with the same file ID/inode number. So to use them properly you need to combine them with some kind of identifier for the file system itself (like fsid_t or dev_t on Unix). It gets messy fast.

Also, if you are writing "tree replication", then you are already stat'ing all the files to get the other metadata, and you're already writing very OS-specific code to preserve all the OS-specific attributes. That same code can easily grab the info you need as far as inode number and file system identifier.

A general crawler would care about the file's inode if it's doing hard link matching. This is an important optimization for a backup/archival client, or e.g. for some batch processing.

If I need to stat the file, I'm not going to just stat it; I'm going to fopen then fstat the open file, to avoid any renaming races (because in the general case my next step will be to read the file). So that's 2 extra syscalls only to find out that the file is a hard link and give up :) By contrast, if I can access the inode for free from the dirent, I can skip the hard links directly, with 2*N fewer syscalls.

Of course, one still needs to track the file system, but stat'ing the directory as you descend gives you that for all files within it.

More generally, the thing is there isn't really any portable API to retrieve a file's ID. You have to assert FileInfo.Sys(), and even then that doesn't work on Windows since you'd need an extra syscall.

Would you please reconsider adding the func (d *DirEntry) FileID() (uint64, error) interface?

On 07/10/2020 18:23, Russ Cox wrote:

Based on the discussion above, this seems like a likely accept.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/41467#issuecomment-705081427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEW7LA4CHNXHBTBDYECMGTDSJSPZLANCNFSM4RRIBAGA.

tv42 commented 3 years ago

Of course, one still needs to track the file system, but stat'ing the directory as you descend gives you that for all files within it.

No, it doesn't. (On Linux.) A mount can be a file, and that file can belong to a different device. And across devices, inodes are meaningless.

$ echo local >one
$ touch two
$ sudo mount --bind /etc/os-release two
$ ls -li
total 8
7277784 -rw-r--r-- 1 tv   tv     6 10-08 14:52 one
4408303 -rw-r--r-- 1 root root 261 09-19 15:39 two
$ stat one
  File: one
  Size: 6               Blocks: 8          IO Block: 4096   regular file
Device: 40h/64d Inode: 7277784     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/      tv)   Gid: ( 1000/      tv)
Access: 2020-10-08 14:52:51.855492173 -0600
Modify: 2020-10-08 14:52:51.855492173 -0600
Change: 2020-10-08 14:52:51.855492173 -0600
 Birth: -
$ stat two
  File: two
  Size: 261             Blocks: 8          IO Block: 4096   regular file
Device: 30h/48d Inode: 4408303     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-10-07 14:20:50.072374181 -0600
Modify: 2020-09-19 15:39:00.000000000 -0600
Change: 2020-09-28 09:05:03.077958127 -0600
 Birth: -

To completely avoid races, you'd need openat(2) with O_PATH, then fstat(2) that to get the full device+inode info for real uniqueness and cross-device checks, and then you can use that fd in future operations, with openat(2) and for files AT_EMPTY_PATH.

The Go stdlib is not prepared to really cope with the above, at this time.

mpx commented 3 years ago

As above, a FileID method is problematic since any identifier is platform/filesystem specific and should cope with current and future platforms (maybe it will need to be expanded to 128bits, multiple fields, or different type?). Even then, platform specific knowledge is required to intepret it. If supported, it would be better handled similarly to os.FileInfo - with a Sys method:


type DirEntry interface {
  // ...

  // underlying data source (can return nil)
  Sys() interface{}
}

Another dirent type would need to be defined in syscall at a later time to make use of it. The syscall.Dirent struct on Unix platforms is intended for parsing dirents, and isn't suited to general use (eg, names are not strings, they are byte arrays with a separate d_reclen field for length).

Uses of Sys would be very rare, with the cost of requiring all implementations to provide an implementation (usually return nil). However, it would provide an escape hatch for enable platform specific optimisations later on (and perhaps avoid the need to re-implement platform specific directory/dirent processing elsewhere).

diamondburned commented 3 years ago

Given that DirEntry is an interface, couldn't we have a Syser interface or perhaps allow DirEntry to be type-asserted to other OS-specific types?

earthboundkid commented 3 years ago

I had the same thought sequence as mpx and diamondburned. I think we should accept the proposal as is, and then if we find that we really need a particular OS variant, we can add it as an interface assertion later, like

type MauveOSFD interface {
     GetMauveFD() float64  // really cutting edge research OS
}
if mfd, ok := entry.(MauveOSFD); ok {
    // ...
}

The main thing is fixing #16399 (which has been open for a long time) now. We can deal with other system specific optimizations later and having an assertable interface should make that possible.

mpx commented 3 years ago

To clarify, I'm leaning toward accepting the proposal as is. FileInfo already provides Sys, so adding a Sys method or similar to DirEntry is a performance optimisation for rare use cases. Admittedly performance is one of the main reasons for wanting file identifiers, so FileInfo.Sys is unlikely to help there. Very few implementations and consumers would use DirEntry.Sys, so it's probably best left out?

I'd appreciate a DirEntry solution for concrete filesystems, but perhaps this could be an optional interface that the os/syscall implementation can provide later. It would make even less sense for virtual file systems unless the concept of a "system file identifier" is generalised -- which is well beyond any reasonable scope for this and the io/fs proposal.

nightlyone commented 3 years ago

The mapping from DT_UNKNOWN to either IsDir == false or IsDir == true is still not clear. Does IsDir == false imply that I need to call Info or not?

mpx commented 3 years ago

@nightlyone DT_UNKNOWN is an implementation detail that callers don't need to know about - implementations will call (&cache) lstat to return the DirEntry with the necessary details. IsDir will always return true when the entry represents a directory, otherwise false.

networkimprov commented 3 years ago

DT_UNKNOWN causes an Lstat for that item during ReadDir and caches the FileInfo, so that .Info() calls are instantaneous. Performance is like (f *File) Readdir(n int) ([]FileInfo, error).

This is one of the mistakes in this proposal, you can't know whether f.ReadDir() will make a single syscall, or one per directory item. And you can't know whether de.Info() will cause an Lstat.

I've been complaining about this for weeks; no one else seems to care.

israel-lugo commented 3 years ago

I am happy with everything else in this proposal, e.g. I think the optional interface to get inodes is an elegant solution if it ends up being implemented for concrete OSes.

That said, the issue that networkimprov raised is a problem. It's why I had included a flag in the other proposal #40352, to control whether the internal Lstat happens.

If I can get the file type for free, that's awesome. I can process directories differently for free. But if not, then I don't want extra lstats for all entries. I'm going to openat and fstat each non-directory file anyway to get its details without racing, so the intermediate lstat is a useless performance hit.

The internal lstat should not be transparent IMO, since it affects whether this API is an O(1) performance optimization or an O(n) performance hit. It may be useful for some clients, but in that case it should at least be optional, or an explicit step.

Note some relevant retrospective comments from the author of the equivalent Python functionality, here: https://github.com/golang/go/issues/40352#issuecomment-669656904

On 11 October 2020 13:22:51 BST, Liam notifications@github.com wrote:

DT_UNKNOWN causes an Lstat for that item during ReadDir and caches the FileInfo, so that .Info() calls are instantaneous. Performance is like (f *File) Readdir(n int) ([]FileInfo, error).

This is one of the mistakes in this proposal, you can't know whether it will make a single syscall, or one per directory item. And you can't know whether .Info() will cause an Lstat.

I've been complaining about this for weeks; no one else seems to care.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/golang/go/issues/41467#issuecomment-706696942 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

benhoyt commented 3 years ago

FWIW, I'm fine with this proposal now that we have Type() -- let's not go round and round on this again. It's the same with the Python version -- it's implementation (and with DT_UNKNOWN, filesystem) dependent whether the likes of IsDir will call stat. But that's okay. You need that information, so it has to be called. The "internal lstat" is not transparent in the Python version, and that hasn't been a problem.

networkimprov commented 3 years ago

Ben, it's not the same as the Python version; .IsDir() never calls lstat because f.ReadDir() does when it sees DT_UNKNOWN. That is baggage in the scenario @israel-lugo described.

benhoyt commented 3 years ago

@networkimprov Ah yes, you're right, sorry. Either way, I don't think it matters.

rsc commented 3 years ago

@israel-lugo, the rationale for DT_UNKNOWN handling is in https://github.com/golang/go/issues/41467#issuecomment-697776129.

Part of the rationale for making ReadDir do the lstat calls is that (1) Readdir already does and (2) we have yet to identify a common case where the lstat is needed.

Older systems - notably AIX and Solaris - do not include a type byte in the dirent structure at all. Those will need lstat.

On other systems, it is technically possible to get type DT_UNKNOWN, but does it happen in any common cases? If so, what are they? What operating systems, what file systems, and what additional conditions? Thanks.

gopherbot commented 3 years ago

Change https://golang.org/cl/261540 mentions this issue: os: add File.ReadDir method and DirEntry type

networkimprov commented 3 years ago

I've read that USB flash drive and CD-ROM filesystems often yield DT_UNKNOWN.

rsc commented 3 years ago

I've read that USB flash drive and CD-ROM filesystems often yield DT_UNKNOWN.

Assuming USB flash drive means FAT (VFAT, etc) and CD-ROM means ISO9660, both of those file systems lay out the file type bits right next to the file name in the physical storage. If a driver has loaded the name, it has the file type bits just sitting there in memory waiting to be used. If it insists on using DT_UNKNOWN instead, then the driver is written inefficiently. I don't believe we should make Go's APIs more complex just because inefficient drivers exist.

Even so, I looked into this a bit. Linux, FreeBSD, and OpenBSD all appear (from driver inspection) to set a proper type in their VFAT file system implementations and to leave DT_UNKNOWN in their ISO9660 implementations. Again, there's no reason they couldn't do the right thing for ISO9660 if users needed it. I can't find the source code for macOS's file system drivers.