golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.66k stars 17.62k forks source link

proposal: path/filepath: Add SameFile and InTree #42202

Open rasky opened 3 years ago

rasky commented 3 years ago

This proposal is alternative to https://github.com/golang/go/issues/42201.

When working with pathnames on filesystems, it is common to inspect the relations between two different paths. In prototypes, it is common to assume that pathnames are uniques and thus two pathnames refer to the same file iff the pathnames are equal. Since many applications operate on tree of files, it is also common to check whether a file is part of a specified tree given its root; this can naively be done with strings.HasPrefix(path, root).

In real world, filesystems offer a variety of different types of links which can make things harder. On top of this, Windows and UNIX have different filesystems with different concepts and incompatible APIs. So finding a common semantic (which is one of the goals of path/filepath) is complicated, and Go does not provide useful primitives for this.

In other languages such as C, it became common under UNIX to use a function such as realpath(3) to obtain a so-called "canonical name" of a path. By using such primitive, it is technically possible to implement equality and containment tests in the following way:

(Notice that there is a different kind of containment test, in which we don't want to follow leaf links, but this can be done with something like realpath(dir(leaf)) starts with realpath(root) so it's not really worth defining a different operation).

There are a few problems with canonicalization:

My proposal is to avoid the temptation to define a canonicalization function, and rather provide helpers to implement pathname tests for applications, to cover their needs. Thus, I propose to add the following functions:

package filepath

// SameFile checks whether path1 and path2 refer to the same underlying file, after all links
// have been resolved. It is a thin wrapper over os.SameFile, that provides convenience.
// On UNIX, SameFile works across symlinks, hardlinks and bind mounts.
// On Windows, SameFile works across symlinks, hardlinks, junctions, drive mappings.
func SameFile(path1, path2 string) bool

// InTree checks whether the file or directory referenced by leaf is contained within the
// directory tree pointed by root. All links (in either leaf or root) are resolved before checking
// for containment.
// One typical case in which this function is useful is to implement a jail in a virtual filesystem,
// to make sure that no I/O is performed outside of the specified root.
// On UNIX, InTree works across symlinks, hardlinks, and bind mounts.
// On Windows, InTree works across symlinks, hardlinks, junctions, drive mappings.
func InTree(leaf, root string) bool

It seems to me that these two functions should provide two primitives that can be used to solve canonicalization problems. I would be interested in hearing use cases which are not covered by these two functions but are solved by realpath or GetFinalPathNameByHandle.

ericwj commented 3 years ago

Go imho still has use for a (set of) function(s) that resolve links and a proper, safe and recommendable way when this is accepted. I hence don't see this as an alternative to #42201, but as a very useful addition.

Also I think this API will perform poorly, which is fine if used sparingly, but will add up checking lots of files or directories in a very deep directory structure. For the high volume calling scenario, I would add the fundamentals on top of which these functions are built to the public API for that reason, such that the application has the perhaps at least twice faster alternative available if it prefers, although it would make the application code slightly more complex.

ianlancetaylor commented 3 years ago

It seems to me that SameFile is the same as calling os.Stat twice followed by os.SameFile. Is there a more efficient way that it can be done in path/filepath?

ericwj commented 3 years ago

Indeed. InTree however will swamp Stat, potentially actually hitting the disk for it and that is totally not needed with some kind of context object that keeps track of a number of os.FileInfos for parent directories, or even just their file and volume ID. Using GetFileInformationByHandleEx with FileIdInfo instead of Stat on recent versions of Windows that have it will be even more efficient - and even required to automagically support ReFS volumes.

rasky commented 3 years ago

It seems to me that SameFile is the same as calling os.Stat twice followed by os.SameFile. Is there a more efficient way that it can be done in path/filepath?

Even if it was just a thin wrapper (which I'm not 100% sure yet), I think it's important to have it as part of path/filepath because equality of paths is a common scenario for which users will look for a solution there. I myself wasn't aware of the existence of os.SameFile until two weeks ago, and I've been importing third-party modules to have a realpath alternative to implement equality (even in a UNIX-only scenario) because I didn't know about it.

filepath.ToSlash and filepath.FromSlash are even thinner wrappers over strings.ReplaceAll so I think there's precedence there in wanting to provide common path manipulation functions irrespective of the actual complexity of building them on top of the rest of std.