vendor/ verification - Githubissues

golang / dep

Go dependency management tool experiment (deprecated)

https://golang.github.io/dep/

BSD 3-Clause "New" or "Revised" License

12.85k stars 1.05k forks source link

vendor/ verification #121

Closed sdboyer closed 6 years ago

sdboyer commented 7 years ago

dep needs a way of verifying that the contents of vendor/ are what's expected. This serves three purposes:

Performance: without a verification mechanism, the only safe choice dep can make is to drop and recreate vendor in its entirety on every operation. This is wasteful and slow; any operation that avoids unnecessary disk writes to determine if vendor needs updating will be faster, and would be a huge boon.
Transparency: dep status is very limited in what it can report about vendor/ if it can't verify that vendor is actually what's expected.
Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.

The simple, obvious solution is to deterministically hash (SHA256, I imagine) the file trees of dependencies, then store the resulting digest in the lock. However, the need to perform vendor stripping (#120) complicates this, as that creates perfectly valid situations in which the code under vendor/ might differ from the upstream source. The more choice around stripping modes we create, the more complex and error-prone this challenge becomes.

Or, we might simply rely on the hashing performed by the underlying VCS - indeed, we already record "revisions" in the lock, so we're already sorta doing this. Relying on it is triply disqualified, though:

It suffers from the same issues with vendor pruning as above.
On the a security front, git, hg, bzr, and svn all rely on the broken SHA1 for basic integrity checks. (Even when git was first released, they made it explicitly clear that their security model is not based on the internal use of SHA1)
It may bind us to VCS being the underlying source for code, which is not a corner we want to paint ourselves into.

The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after). This would allow us to be more adaptive about assessing package integrity - e.g., we can know that it's OK if certain files are missing.

The downside is that these lists of file hash digests would be potentially huge (e.g., if you import k8s), though that might be mitigated by placing them under vendor/ rather than directly in the lock - sdboyer/gps#69. Also, without the larger security model in place, I don't know if disaggregating hashing to the individual file level might compromise some broader notion of integrity that will prove important.

We might also record just the project unit-level hash against some future day where we've reinvented GOPATH and are no longer relying on vendor.

calebhearth commented 7 years ago

I've tackled some verification of binaries in the past for a tool I built using GPG and SHA256: https://github.com/thoughtbot/ftree/tree/master/dist#verifying

sdboyer commented 7 years ago

@calebthompson Thanks for the link! I'll have a look through your stuff as soon as I can. Quickly, though - our goal here is verifying a source tree, not the compiled binary (maybe that matters, maybe it doesn't). I suspect we also have some thinking to do about what we're verifying against.

calebhearth commented 7 years ago

Definitely. We could pipe the entire tree into either tool to generate the signatures.

$ git ls-files **/*go | xargs cat | gpg --detach-sign --armor -o tree.gpg
$ gpg --verify tree.gpg **/*.go
gpg: Signature made Tue Jan 24 10:00:37 2017 MST using RSA key ID A0ACE70A
gpg: Good signature from "Caleb Thompson <caleb@calebthompson.io>" [ultimate]
gpg:                 aka "Caleb Thompson <caleb@heroku.com>" [ultimate]
$ echo "foo" > foo.go
$ gpg --verify tree.gpg **/*.go
gpg: Signature made Tue Jan 24 10:00:37 2017 MST using RSA key ID A0ACE70A
gpg: BAD signature from "Caleb Thompson <caleb@calebthompson.io>" [ultimate]

Dr-Terrible commented 7 years ago

Performance: without a verification mechanism, the only safe choice dep can make is to drop and recreate vendor in its entirety on every operation. This is wasteful and slow; any operation that avoids unnecessary disk writes to determine if vendor needs updating will be faster, and would be a huge boon. Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.

Some time ago the folks at Gentoo had to tackle the same exact problem for their package manager, Portage; they ended up with the generation of a digest for every first-level directory's content of their package manager's tree (which is file-based and quite huge).

The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after).

That is completely unnecessary. golang/dep can use the same solution of Gentoo's Portage by check-summing the vendor/ directories itself; yet hashing a directory's data poses not insignificant challenges: first things first, to guarantee the validity of the hash the order of the directory's data must be predictable across all the OS supported by golang/dep. For example on *BSD systems a find -s vendor/ -type f -exec md5sum {} \; | md5sum¹ is fine because "find -s" guarantees the order of the directory listings, but on Linux and Windows you need to use a hack in the form of find vendor/ -type f -exec md5sum {} \; | sort -k 2 | md5sum to obtain the same result. Linux and windows don't guarantee that file-systems can maintain the directory listings in a stable, predictable order.

Hence, listing the directory's data is a naive approach, and prone to errors; a better solution should be to hash the directory's data and its metadata altogether (stored as an archive): find vendor/ | cpio -o | md5sum. This way even changes to file permissions and ownership are detect, if this behaviour is required by golang/dep.

Yet even considering data+metadata doesn't solve the problem that content sorting differs from one file-system to another, even on the same OS. Therefor, listing the directory's data+metadata throughout external tools can't produce consistent hashes on different OSes, leading to undefined behaviours and false positives.

The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within golang/dep, filtering its content in a predictable order (such as alphabetic order, or other reliable heuristics), store it in a temporary archive and then check-summing it. It's far better and efficient than traverse and record hashes for individual files contained in vendor/, not to mention that you can safely ignore #120 because you have control over how you traverse the vendor/ directory.

Go standard library already offers everything that is required, from file-system abstraction to archive generation, in a cross-platform way, thus guaranteeing the uniqueness of the hashes across different platforms. As a final note, this solution would be marginally susceptible to length-extension attacks depending on the hash function used.

¹- md5sum is used only as an example for brevity, but any hash function should work here;

sdboyer commented 7 years ago

This is great info and context - thanks. Some quick thoughts:

Some time ago the folks at Gentoo had to tackle the same exact problem

I'm not sure it's exactly the same. This process is occurring on a user's system, probably when a lock.json is generated. The first sentence of that link reads:

MetaManifest provides a means of verifiable distribution from Gentoo Infrastructure to a user system

So, that's central-to-user; we have a user-to-user problem. That may not end up mattering, if only because there's no uniform, reliable way to verify a trusted upstream source. I'm just noting it while I absorb this information.

The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after).

That is completely unnecessary.

I think you may have missed the original reasoning for this: if we are to admit the possibility of vendor pruning (#120), and especially multiple/customizable types of pruning, then it would be valid to have a vendor/ directory with only a subset of the original files present in the upstream source. (I see you've commented over there, though, so this is definitely not news). Maybe that's OK - maybe changing any of those configuration options for what gets pruned necessarily changes the lock with the results of hashing the new vendored set.

But, there also may have to be additional files in vendor to allow for processing/code generation type tasks. That's much harder to integrate.

The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within golang/dep

The info about the issues with command-driven tooling is useful, but don't worry - we were never considering doing this any way other than directly within the tool, using the facilities provided in stdlib.

mikijov commented 7 years ago

Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.

On the a security front, git, hg, bzr, and svn all rely on the broken SHA1 for basic integrity checks. (Even when git was first released, they made it explicitly clear that their security model is not based on the internal use of SHA1)

I wonder if we need to tackle security at all? We inherit security from the underlying vcs. We inherit both vulnerabilities and future improvements this way. But more specifically, if security is paramount, user should commit vendored code into own repo. In this situation the only two posibilities for the code to be comprimised is during initial clone or in users repo. If initial clone is a suspect, then no security of ours can detect that. If users repo was compromised, then the lock file could be changed as well. In other words, what is the use case of layering extra security on top of vcs?

sdboyer commented 7 years ago

@mikijov Security is its own issue, and yeah, we do have to seriously address it. However, this issue is about verification - related to, but independent of, security. Without this verification mechanism, it's much more difficult, for us to check that the contents of a vendor/ directory are what we expect them to be (according to the lock file). Without the ability to check that, we have to fully regenerate vendor/ on every dep ensure, which is very wasteful. There are other implications for further down the line, too.

@karrick was working on this during the Gophercon hack day, and I think we have an algorithm together to do the hashing: https://github.com/karrick/dep/commit/9449c598c5fe5c4e50085ee699b16d065c6ad6a5

karrick commented 7 years ago

@sdboyer,

While working on the algorithm to recursively hash the file system contents rooted at a specified file system object, I came across a possible gotcha when we follow symbolic links, which made me consider that the reason filepath.Walk actually elides following symbolic links might be to avoid circular loops when a symbolic link exists that causes an infinite loop.

The naive solution to preventing an infinite loop is quite simple, but does not scale.

We could punt on this issue for now, and pretend that there are no symbolic links that would cause an infinite loop, and we can continue on the solution we discussed at GopherCon, but I wanted you to be made aware of it.

Or if you have any suggestions, please let me know.

karrick commented 7 years ago

Maybe you have a suggestion for quickly detecting whether the file system node has been visited already.

My initial idea was to use a map and perform a quick key check, despite this method involves a constantly growing map on the heap. However, in order to insulate from the underlying os-specific data structures, there is no way I have found to grab the underlying fileStat (not FileInfo) information returned by the runtime in order to create a hash key from the node.

Therefore the only way I can think of to correctly determine whether the current node has been visited is to use the os.SameFile function with the current FileInfo and a list of FileInfo instances already visited.

package fs

import (
    "crypto/sha256"
    "fmt"
    "io"
    "os"
    "path/filepath"
    "sort"
    "strconv"

    "github.com/pkg/errors"
)

var pathSeparator = string(os.PathSeparator)

// HashFromNode returns a deterministic hash of the file system node specified
// by pathname, performing a breadth-first traversal of directories, while
// ignoring any directory child named "vendor".
//
// While filepath.Walk could have been used, that standard library function
// skips symbolic links, and for now, it's a design requirement for this
// function to follow symbolic links.
//
// WARNING: This function loops indefinitely when it encounters a loop caused by
// poorly created symbolic links.
func HashFromNode(pathname string) (hash string, err error) {
    // bool argument: whether or not prevent file system loops due to symbolic
    // links
    return hashFromNode(pathname, false)
}

func hashFromNode(pathname string, preventLoops bool) (hash string, err error) {
    h := sha256.New()
    var fileInfos []os.FileInfo

    // Initialize a work queue with the os-agnostic cleaned up pathname.
    pathnameQueue := []string{filepath.Clean(pathname)}

    for len(pathnameQueue) > 0 {
        // NOTE: pop a pathname from the queue
        pathname, pathnameQueue = pathnameQueue[0], pathnameQueue[1:]

        fi, er := os.Stat(pathname)
        if er != nil {
            err = errors.Wrap(er, "cannot stat")
            return
        }

        fh, er := os.Open(pathname)
        if er != nil {
            err = errors.Wrap(er, "cannot open")
            return
        }

        // NOTE: Optionally disable checks to prevent infinite recursion when a
        // symbolic link causes an infinite loop, because this method does not
        // scale.
        if preventLoops {
            // Have we visited this node already?
            for _, seen := range fileInfos {
                if os.SameFile(fi, seen) {
                    goto skipNode
                }
            }
            fileInfos = append(fileInfos, fi)
        }

        // NOTE: Write pathname to hash, because hash ought to be as much a
        // function of the names of the files and directories as their
        // contents. Added benefit is that empty directories effect final hash
        // value.
        //
        // Ignore return values from writing to the hash, because hash write
        // always returns nil error.
        _, _ = h.Write([]byte(pathname))

        if fi.IsDir() {
            childrenNames, er := fh.Readdirnames(0) // 0: read names of all children
            if er != nil {
                err = errors.Wrap(er, "cannot read directory")
                return
            }
            // NOTE: Sort children names to ensure deterministic ordering of
            // contents of each directory, so hash remains same even if
            // operating system returns same values in a different order on
            // subsequent invocation.
            sort.Strings(childrenNames)

            for _, childName := range childrenNames {
                switch childName {
                case ".", "..", "vendor":
                    // skip
                default:
                    pathnameQueue = append(pathnameQueue, pathname+pathSeparator+childName)
                }
            }
        } else {
            // NOTE: Format the file size as a base 10 integer, and ignore
            // return values from writing to the hash, because hash write always
            // returns a nil error.
            _, _ = h.Write([]byte(strconv.FormatInt(fi.Size(), 10)))
            _, er = io.Copy(h, fh)
            err = errors.Wrap(er, "cannot read file") // errors.Wrap only wraps non-nil, so elide checking here
        }

    skipNode:
        // NOTE: Close the file handle to the open directory or file.
        if er = fh.Close(); err == nil {
            err = errors.Wrap(er, "cannot close")
        }
        if err != nil {
            return // early termination if error
        }
    }

    hash = fmt.Sprintf("%x", h.Sum(nil))
    return
}

sdboyer commented 7 years ago

@karrick i've actually spent way, way too much time thinking about the roiling hellscape that is symlinks. i've spent some time investigating extant symlink cycle detection algorithms, and they...suck. just, categorically, suck. turns out, graphs are hard.

however, for once, i think i have a happy answer to a symlink-related question! 😄

basically: the hasher should never, ever traverse symlinks. if it encounters one, then it should instead hash the contents/address of the link itself (basically, the output of os.ReadLink(). this is essentially the behavior of pkgtree.ListPackages() right now, which matters because that function is the one that creates a virtual representation of the filesystem that, in turn, is used to decide what things actually exist.

i believe this is sufficient for all cases because of some key invariants:

only the bits in files that are reachable as descendants of the project tree matter
all of the bits in the tree are always incorporated by the hasher
the ability to hash a tree does not guarantee that there are no broken symlinks in that tree

(it is not a coincidence that these invariants mirror those of snapshot-based version control systems)

i'm not entirely sure of this approach (i never am with symlinks, because symlinks 😱💥💀), so lemme walk it through.

symlinks can vary along a number of independent dimensions that are relevant for our purposes:

referent is file or directory (or another symlink)
symlink is absolute or relative
symlink's referent escapes the tree that encompasses the original link (so, if relative, it contains at least one ..)

as long as do not embed a guarantee in the hashing process that symlinks reaching outside of a project tree must exist/be valid, then we're fine to simply record the target of the symlink. if the symlink escapes the tree (which, if we want to defend against, is something we should enforce elsewhere), then we're making no correctness guarantee anyway. if it does not, then the structure of the hashing algorithm guarantees that the bits contained in/under the referent will be reached through normal, non-symlink means. given that, there's no added value to hashing the bits again; it's sufficient to simply record that a link exists.

the only real question in my mind with this is how we record the fact - in a cross-platform way - that a given file is a symlink in the hashing inputs itself, so that the hasher disambiguates between a normal file containing an address and a symlink with that same address.

karrick commented 7 years ago

@sdboyer, thanks for the feedback.

I spent a bit of time today looking at the surrounding dep code looking for where I'd hook this into, but I did not run across it yet. But I did see that other parts of dep were ignoring symlinks, and was curious about the necessity of following them in the hashing function if we ignore them elsewhere.

I'll update my fork tomorrow based on your feedback, and send a PR your way for the hashing code.

sdboyer commented 7 years ago

sounds great!

I spent a bit of time today looking at the surrounding dep code looking for where I'd hook this into, but I did not run across it yet.

yeah, i hadn't previously carved out a place for it. i have some basic thoughts on where to fit it in:

first and simplest would be to just add a workhorse func to internal/gps/pkgtree: HashPackageTree(path string). i suspect that returning a []byte is probably best for future flexibility, but i could be convinced of a fixed-length array, too.

then, we'll want something like gps.VerifyDepTree(basedir string, l Lock) (map[ProjectRoot]MismatchType, error). the basedir would either be an actual vendor dir, or at least have the same semantics.MismatchType would basically be Go's anemic version of an enum:

type MismatchType uint8
const (
    NoMismatch MismatchType = iota
    EmptyDigestInLock
    DigestMismatchInLock
    NotInLock
    NotInTree
)

i feel like these signatures and names may be enough to suggest the desired behavior, but lmk if more detail would be helpful. (also, we need a better name than MismatchType, as that's too general for this special-purpose type)

eventually, i'd like to see this func called in a more flexible way from up in dep - e.g., we dispatch the hasher in the background, prior to running the solver. we should be able to get some parallelism benefits there; both exert CPU pressure, but to the extent they are I/O-bound, the hashing code is disk-intensive, whereas the solver is often more network-intensive.

but, let's have a first PR just focus on these two functions, and the concomitant additions to gps.LockedProject (it needs the new hash digest). once we're happy with those, we can actually start making use of the verification mechanism during normal operations.

karrick commented 7 years ago

Someone feel free to correct me if I'm wrong, but IIRC one of the reasons that file size prefixed its contents when git creates a hash for the file was because SHA-1 was chosen because of its speed and "good enough" quality, despite not being a very long hash. However, in dep we're free to use SHA-256, which is both fast and long, and it seems a bit cargo-cultish to include file sizes when creating a hash for its contents.

This comment doesn't really change the algorithm much, but I thought I'd point that out. We can always remove the file size from the hash in the future, and worst case people will re-download their dependencies another time.

sdboyer commented 7 years ago

because SHA-1 was chosen because of its speed and "good enough" quality, despite not being a very long hash. However, in dep we're free to use SHA-256, which is both fast and long, and it seems a bit cargo-cultish to include file sizes when creating a hash for its contents.

i'm very much neither a cryptographer nor a cryptanalyst, so i should start by categorically admitting that what's going on here is at least somewhat, if not entirely, cargo culting. and, on re-reading Linus' response to the SHA1 collision earlier this year, it seems like the length prefixing strategy may be less helpful in our case, as we do not have the equivalent of a blob object with headers (which include length) to generate, and then incorporate back up through the tree object, and into the commit object & the Merkle DAG.

then again, that might not be that relevant; if the attacker has some feasibly cheap attack on the hashing algorithm, then repeating it a few times for each of those levels probably isn't prohibitive.

i think the real question here is whether the necessity of generating correlated data could thwart a particular attack. that is, if the attacker has found some weakness in SHA256 such that they can write a data generation algorithm with a better-than-random chance of producing a targeted digest, then would forcing that algorithm to also consider the length of the data it is generating at each step undermine its effectiveness? basically, we're reducing degrees of freedom.

again, i'm not a cryptanalyst, and i've no idea what algorithms like these look like. but, even with so many more bits in SHA256 than SHA1 it seems like at least some attacks would probably be mitigated by this. so, given how easy it is for us to add the length prefix, and that there's no (?) downside for us in doing so, i'm still inclined to include it. (but, still very open to being convinced it's pointless 😄)

We can always remove the file size from the hash in the future, and worst case people will re-download their dependencies another time.

while it's definitely true that changing the algorithm would not have the same catastrophic effects as it would in git - again, we aren't building Merkle trees - i still don't think we should look at this as a road easily taken. there'll still be the body of committed Gopkg.lock files with the old data, and we'll have to make the tool interoperate with them. that's a non-zero cost. so, i'd like to be as sure as we can reasonably be that we're happy with our implementation from right out the gate.

sdboyer commented 7 years ago

oh, interesting - again, SUPER out of my depth here, but perhaps it's even more important to do this, as including the length prefix may (???) thwart length extension attacks, and i believe that at least crypto/sha256 is based on SHA2/FIPS 180-2 which is known to be vulnerable to such attacks. (golang.org/x/crypto/sha3 does have new algos without such vulnerabilities, though)

sdboyer commented 7 years ago

ohhh - another, more pedestrian question - the impl you pasted in above appears to feed everything into one giant SHA256 hash writer for all the files in a tree. does that not mean that, while feeding the hasher, the algorithm will need to hold open memory equal to the size of all files in the tree? i'm now picturing us OOMing on, say, k8s.

if so, could we avoid that by mimicking git a bit further - hash each file individually, then build a tree of the digests + paths that, in turn, we hash for the final digest value?

karrick commented 7 years ago

@sdboyer, excellent question about one giant hash instance. It turns out that Go hash implementations do not store every byte ever written to them. They do have a few internal buffers used to chunk bytes to the hash in fixed size quantities, but SHA-256 has a 64 byte chunking buffer, and the 32 byte hash buffer, which is continuously updated as bytes are written to the hash.

In other words it is perfectly fine to instantiate a single hash, keep writing bytes, and to take the final sum when complete. For our particular use, I did not see the purpose of creating new hashes for every file system node, just to hash the results of their hashes into some master hash.

karrick commented 7 years ago

@sdboyer, when I answered your previous question, I had meant to throw a new question back your way.

I understand what you describe with your description of the proposed HashPackageTree function, along with the MismatchType enum-like construct. However, this feature involves creating a check that ensures whatever library is written in the vendor/ directory matches what is in the Gopkg.lock file, by checking to ensure each library in vendor/ has a directory hash that matches some pre-computed value.

I think it is a safe assumption that libraries are the list of directories that are three levels down from the original vendor/ directory. Here's an example from the dep project to illustrate the assumption I'm going to make:

The following directories would be considered libraries in the vendor directory, and a hash of their directories would be calculated and compared to what we will soon store in the Gopkg.lock file:

vendor/github.com/armon/go-radix
vendor/github.com/go-yaml/yaml
vendor/github.com/Masterminds/semver
vendor/github.com/Masterminds/vcs
vendor/github.com/pelletier/go-buffruneio
vendor/github.com/pelletier/go-toml
vendor/github.com/pkg/errors
vendor/github.com/sdboyer/constext

The following directories are more than three levels down from the vendor/ directory, and would be ignored:

vendor/github.com/pelletier/go-toml/cmd
vendor/github.com/pelletier/go-toml/cmd/tomljson
vendor/github.com/pelletier/go-toml/cmd/tomll

If we cannot make this assumption, and code libraries could be at a different depth from the vendor/ directory, then I will have to rethink my current implementation plan. Let me know if you think this assumption is invalid.

karrick commented 7 years ago

@sdboyer, I agree with your point concerning reducing the degrees of freedom for a determined crypto attack. Thanks for bringing it up.

sdboyer commented 7 years ago

However, this feature involves creating a check that ensures whatever library is written in the vendor/ directory matches what is in the Gopkg.lock file, by checking to ensure each library in vendor/ has a directory hash that matches some pre-computed value.

yes - that's pretty much the flow i'm imagining for us :) is there an alternative?

I think it is a safe assumption that libraries are the list of directories that are three levels down from the original vendor/ directory.

if only it were 🤕

launchpad, which supports two-level roots (e.g. launchpad.net/gocheck) is an obvious exception, but the actual ruleset makes no guarantee of three levels.

nevertheless, i don't think you need to rethink implementation much. while we can't know a priori how deep the root is along any given path, the root location is knowledge the rest of the system already has - and it's represented in the gps.Lock, via Lock.Projects(). the HashPackageTree(path string) func i described will need to be called for each of those roots.

the only slightly tricky bit would be the NotInLock case - when there's directories in vendor that are not equal to or logical children of any of the roots given in the locked project list. still, that shouldn't be too bad - gps already makes liberal use of github.com/armon/go-radix for tasks like these, and isPathPrefixOrEqual() should get you most of the rest of the way. (it's also possible that we may not actually need to hash such dirs at all).

karrick commented 7 years ago

the only slightly tricky bit would be the NotInLock case

Exactly why I proposed the 3-layer down rule, but glad you responded with the launchpad.net counter example so we could rule that out. Furthermore, if there are 2-layer down libraries, there equally might be 4-layer down libraries.

The problem is while walking the directory tree, identifying when the function comes upon a directory that is the top level of a library.

... is there an alternative?

I think I might have one. Walk the vendor/ tree with the project dependency list in hand. When a particular directory matches one of the dependencies declared in the list, then hash that particular directory as one would normally do, to ensure its contents match what is stored in the Gopkg.lock file. Every other file system node will be put in a list of extras, and that extra list ought to be empty. dep ensure could either return an aptly worded error message or chose to remove the file system nodes in that list, to ensure nothing can pollute the project build.

While being able to logically group extra directories into libraries might be desired, there is no way to identify where the top level of a library begins without heuristics, such as looking for the presence of a README.md file, a LICENSE file, a CONTRIBUTING file, or maybe even one of the VCS directories.

Perhaps we consider using a hybrid approach where directories matching one of the dependencies in Gopkg.lock would be called out as a dependency, yet we are free to use heuristics for extra directories found in vendor/.

sdboyer commented 7 years ago

Furthermore, if there are 2-layer down libraries, there equally might be 4-layer down libraries.

Or six. Or ten. There are no guarantees here. go get HTML metadata, or infixed vcs extensions, allow this position to be arbitrary. The problems arising from this is one of the reasons why i'd like to see fixed-length roots for registries - https://github.com/golang/dep/issues/174#issuecomment-275589961

The problem is while walking the directory tree, identifying when the function comes upon a directory that is the top level of a library.

Ah, yes, so, this is one crucial bit.

dep makes the assumption that the repository root is the same as the project root (for dependencies - you can make things work a little differently if e.g. you want to manage a monorepo that's never imported). This equivalency relationship is defined in large part because it makes figuring out where project roots are vastly easier. We just have to figure out the repo root - which means relying on import path deduction in the same way go get does today.

That deduction behavior is encapsulated in SourceManager.DeduceProjectRoot(). That method is generally quite optimized (and all *SourceMgr method impls are threadsafe), and so is safe to call under normal circumstances.

...though I think that calling it on the leading elements of a path (e.g. ghe.somecompany.com/sdboyer but not yet ghe.somecompany.com/sdboyer/gps) could end up inducing an HTTP request to look for go get metadata, and then not cache the failure to retrieve it. The system wasn't optimized for iteratively walking down paths in this way.

In any case:

there is no way to identify where the top level of a library begins without heuristics

We don't have to rely on heuristics for this, because the decision is made entirely on the basis of import path structure. That makes classification much, much easier.

to ensure nothing can pollute the project build.

Another crucial bit, here - dep has the viewpoint that it wholly owns the vendor/ directory, and today dep ensure blows it away with great prejudice. While this verification logic will allow us to be more selective in modifying vendor/, we're not changing the basic assumption of complete ownership. In the land of dep, there is only one known good state for vendor/: exactly what is described in Gopgk.lock, nothing less, and nothing more.

Perhaps we consider using a hybrid approach where directories matching one of the dependencies in Gopkg.lock would be called out as a dependency, yet we are free to use heuristics for extra directories found in vendor/.

Because of the two "crucial bits" above - we have a clear system for deciding where roots are, and we aggressively and non-optionally trim detritus from vendor/ - I think that we can generally get away without giving

Walk the vendor/ tree with the project dependency list in hand. When a particular directory matches one of the dependencies declared in the list, then hash that particular directory as one would normally do, to ensure its contents match what is stored in the Gopkg.lock file. Every other file system node will be put in a list of extras, and that extra list ought to be empty.

This also sounds generally fine, though it has the issue that you'll need to keep some kind of a trail - i.e., the dir github.com itself will need to be marked as an extra until dependencies beneath it are found that match entries in Gopkg.lock.

Off the top of my head, an approach i might take using a radix tree would be:

Iterate over the Lock.Projects(), and for each of the root path entries in there...
Split the entries on /, inserting all the leading elements with a marker indicating "prefix of known root, but not a root" - say, 1
Insert the actual root elements with a marker indicating as much - say, 2
Walk down vendor/, checking for the longest prefix match against the radix tree i. If it's a 2, we're on a root - stop walking down and call the hasher ii. If it's a 1 and the current path == prefix, add any local non-directory files to the extra list, then keep walking iii. If it's a 1 and the current path is longer than the prefix, stop walking and record the whole dir path as an extra iv. If there's no match at all, stop walking and record the path as an extra

Unless i've missed something, this approach would obviate the need for doing root deduction at all. It does, however, mean that some context may be lost if we were to choose to show any messages - e.g., if the last project from bitbucket.org is being removed, then messages will only reflect that bitbucket.org is being removed, not the actual project. I don't think that's a big problem, though.

karrick commented 7 years ago

@sdboyer,

I've got some interesting preliminary benchmarks from a new custom directory walking function.

First we establish a baseline using the find utility.

[karrick@linux walk-stdlib]$ time find ~ >/dev/null

real    0m17.218s
user    0m2.437s
sys     0m5.279s

Then we walk the directory tree using the standard library filepath.Walk function.

[karrick@linux walk-stdlib]$ cd ..
[karrick@linux examples]$ time walk-stdlib/walk-stdlib ~ >/dev/null

real    0m20.882s
user    0m7.407s
sys     0m13.117s

Then we walk the directory tree using a new faster walk function.

[karrick@linux examples]$ time walk-fast/walk-fast ~ >/dev/null

real    0m10.332s
user    0m10.557s
sys     0m2.827s

I know some of that is due to virtual file system caching, but it's at least a promising start.

karrick commented 7 years ago

After many more benchmarks, the find utility finishes the same directory structure in less than 4 seconds. So what I posted above was not accurate because the OS caches were not primed.

However, the fast walking example consistently performs at 2x speed as compared to the version that uses the standard library filepath.Walk function.

karrick commented 7 years ago

The modified code works well on Linux and Mac, but not on Windows, because of different syscalls available, as expected. I think godirwalk ought to have two different code paths with correct build flags, so it naturally runs on different operating systems.

karrick commented 7 years ago

I have extracted the directory walking logic to https://github.com/karrick/godirwalk, which now works with unix and Windows. Incidentally it even corrects a subtle bug on Windows over the standard library. On Windows, a symlink to a directory has multiple type mode bits set. The logic in filepath.Walk fails to detect the situation and attempts to follow those symlinked directories. By reversing the bit checks, the godirwalk project correctly detects a symlinked directory and either ignores it or follows it, depending on how the walk function is invoked. I have updated my local copy of dep to use gordirwalk.Walk and all tests still pass. Weren't there a few tests that had to be deactivated on Windows?

sdboyer commented 7 years ago

sounds like a good time for a PR to switch out our internal impl and use godirwalk for the hasher, then 😄

CrushedPixel commented 6 years ago

Just wondering what the state of this issue is - it seems like @karrick came up with a cross-platform method of creating a hash of each vendored dependency. What steps have to be taken next to push resolution of this important issue forward?

qknight commented 6 years ago

@sdboyer in nixos the problem for generating a hash of a GIT repository is already solved. i've added some notes here: https://github.com/nixcloud/dep2nix#golang-dep-git-repo-hash-extension

i love the dep effort and using dep2nix it is really easy now to finally nail down all the dependency hashes to package a go project. but here is the catch:

using nix for go deployment we have to build a Gopkg.lock -> deps.nix manually, each time the project dependencies change. manually means lots of effort for the maintainer of the package.

here is the difference to yarn:

yarn.lock adds a sha256 hash per dependency (tar.bz2) so we can generate a yarn.nix file on the fly
Gopkg.lock lacks a hash (GIT repo) as there is no way to serialize a GIT repository cross platform

that said, using the NAR file specification along with the store concept and the hash implementation we have with nix, we pretty much already have all the tools dep would need to make generate a hash. should we now use that c++ code and create a go library from it which can be used by dep?

libraries needed for building the hash (implemented as go library using cgo):

cross platform example code with FUSE https://github.com/billziss-gh/cgofuse
https://github.com/NixOS/nix/blob/master/src/libutil/hash.hh
if the dependency graph is still correct, we need libnixutil and libnixformat , see https://lastlog.de/blog/posts/media/nix-instantiate-dependencies.jpg

qknight commented 6 years ago

@sdboyer do you have any license concerns using nix related code as a dependency?

sdboyer commented 6 years ago

Finally got this all wrapped up!

@qknight i appreciate the offer (and my apologies for not responding sooner), but we had a planned roadmap for all of this already, and have had a CRLF-normalizing hasher in place since last fall. That's what i utilized when implementing #1912.