Closed sdboyer closed 6 years ago
I've tackled some verification of binaries in the past for a tool I built using GPG and SHA256: https://github.com/thoughtbot/ftree/tree/master/dist#verifying
@calebthompson Thanks for the link! I'll have a look through your stuff as soon as I can. Quickly, though - our goal here is verifying a source tree, not the compiled binary (maybe that matters, maybe it doesn't). I suspect we also have some thinking to do about what we're verifying against.
Definitely. We could pipe the entire tree into either tool to generate the signatures.
$ git ls-files **/*go | xargs cat | gpg --detach-sign --armor -o tree.gpg
$ gpg --verify tree.gpg **/*.go
gpg: Signature made Tue Jan 24 10:00:37 2017 MST using RSA key ID A0ACE70A
gpg: Good signature from "Caleb Thompson <caleb@calebthompson.io>" [ultimate]
gpg: aka "Caleb Thompson <caleb@heroku.com>" [ultimate]
$ echo "foo" > foo.go
$ gpg --verify tree.gpg **/*.go
gpg: Signature made Tue Jan 24 10:00:37 2017 MST using RSA key ID A0ACE70A
gpg: BAD signature from "Caleb Thompson <caleb@calebthompson.io>" [ultimate]
Performance: without a verification mechanism, the only safe choice dep can make is to drop and recreate vendor in its entirety on every operation. This is wasteful and slow; any operation that avoids unnecessary disk writes to determine if vendor needs updating will be faster, and would be a huge boon. Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.
Some time ago the folks at Gentoo had to tackle the same exact problem for their package manager, Portage; they ended up with the generation of a digest for every first-level directory's content of their package manager's tree (which is file-based and quite huge).
The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after).
That is completely unnecessary. golang/dep
can use the same solution of Gentoo's Portage by check-summing the vendor/
directories itself; yet hashing a directory's data poses not insignificant challenges: first things first, to guarantee the validity of the hash the order of the directory's data must be predictable across all the OS supported by golang/dep
. For example on *BSD systems a find -s vendor/ -type f -exec md5sum {} \; | md5sum
ยน is fine because "find -s" guarantees the order of the directory listings, but on Linux and Windows you need to use a hack in the form of find vendor/ -type f -exec md5sum {} \; | sort -k 2 | md5sum
to obtain the same result. Linux and windows don't guarantee that file-systems can maintain the directory listings in a stable, predictable order.
Hence, listing the directory's data is a naive approach, and prone to errors; a better solution should be to hash the directory's data and its metadata altogether (stored as an archive): find vendor/ | cpio -o | md5sum
. This way even changes to file permissions and ownership are detect, if this behaviour is required by golang/dep
.
Yet even considering data+metadata doesn't solve the problem that content sorting differs from one file-system to another, even on the same OS. Therefor, listing the directory's data+metadata throughout external tools can't produce consistent hashes on different OSes, leading to undefined behaviours and false positives.
The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within golang/dep
, filtering its content in a predictable order (such as alphabetic order, or other reliable heuristics), store it in a temporary archive and then check-summing it. It's far better and efficient than traverse and record hashes for individual files contained in vendor/
, not to mention that you can safely ignore #120 because you have control over how you traverse the vendor/
directory.
Go standard library already offers everything that is required, from file-system abstraction to archive generation, in a cross-platform way, thus guaranteeing the uniqueness of the hashes across different platforms. As a final note, this solution would be marginally susceptible to length-extension attacks depending on the hash function used.
ยน- md5sum is used only as an example for brevity, but any hash function should work here;
This is great info and context - thanks. Some quick thoughts:
Some time ago the folks at Gentoo had to tackle the same exact problem
I'm not sure it's exactly the same. This process is occurring on a user's system, probably when a lock.json
is generated. The first sentence of that link reads:
MetaManifest provides a means of verifiable distribution from Gentoo Infrastructure to a user system
So, that's central-to-user; we have a user-to-user problem. That may not end up mattering, if only because there's no uniform, reliable way to verify a trusted upstream source. I'm just noting it while I absorb this information.
The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after).
That is completely unnecessary.
I think you may have missed the original reasoning for this: if we are to admit the possibility of vendor pruning (#120), and especially multiple/customizable types of pruning, then it would be valid to have a vendor/
directory with only a subset of the original files present in the upstream source. (I see you've commented over there, though, so this is definitely not news). Maybe that's OK - maybe changing any of those configuration options for what gets pruned necessarily changes the lock with the results of hashing the new vendored set.
But, there also may have to be additional files in vendor to allow for processing/code generation type tasks. That's much harder to integrate.
The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within
golang/dep
The info about the issues with command-driven tooling is useful, but don't worry - we were never considering doing this any way other than directly within the tool, using the facilities provided in stdlib.
Security: without a verification mechanism, we can't know that the code we're building with is the code we expect it should be. We haven't articulated a security model yet, but this is obviously a significant part of it.
On the a security front, git, hg, bzr, and svn all rely on the broken SHA1 for basic integrity checks. (Even when git was first released, they made it explicitly clear that their security model is not based on the internal use of SHA1)
I wonder if we need to tackle security at all? We inherit security from the underlying vcs. We inherit both vulnerabilities and future improvements this way. But more specifically, if security is paramount, user should commit vendored code into own repo. In this situation the only two posibilities for the code to be comprimised is during initial clone or in users repo. If initial clone is a suspect, then no security of ours can detect that. If users repo was compromised, then the lock file could be changed as well. In other words, what is the use case of layering extra security on top of vcs?
@mikijov Security is its own issue, and yeah, we do have to seriously address it. However, this issue is about verification - related to, but independent of, security. Without this verification mechanism, it's much more difficult, for us to check that the contents of a vendor/
directory are what we expect them to be (according to the lock file). Without the ability to check that, we have to fully regenerate vendor/
on every dep ensure
, which is very wasteful. There are other implications for further down the line, too.
@karrick was working on this during the Gophercon hack day, and I think we have an algorithm together to do the hashing: https://github.com/karrick/dep/commit/9449c598c5fe5c4e50085ee699b16d065c6ad6a5
@sdboyer,
While working on the algorithm to recursively hash the file system contents rooted at a specified file system object, I came across a possible gotcha when we follow symbolic links, which made me consider that the reason filepath.Walk
actually elides following symbolic links might be to avoid circular loops when a symbolic link exists that causes an infinite loop.
The naive solution to preventing an infinite loop is quite simple, but does not scale.
We could punt on this issue for now, and pretend that there are no symbolic links that would cause an infinite loop, and we can continue on the solution we discussed at GopherCon, but I wanted you to be made aware of it.
Or if you have any suggestions, please let me know.
Maybe you have a suggestion for quickly detecting whether the file system node has been visited already.
My initial idea was to use a map and perform a quick key check, despite this method involves a constantly growing map on the heap. However, in order to insulate from the underlying os-specific data structures, there is no way I have found to grab the underlying fileStat
(not FileInfo
) information returned by the runtime in order to create a hash key from the node.
Therefore the only way I can think of to correctly determine whether the current node has been visited is to use the os.SameFile
function with the current FileInfo
and a list of FileInfo
instances already visited.
package fs
import (
"crypto/sha256"
"fmt"
"io"
"os"
"path/filepath"
"sort"
"strconv"
"github.com/pkg/errors"
)
var pathSeparator = string(os.PathSeparator)
// HashFromNode returns a deterministic hash of the file system node specified
// by pathname, performing a breadth-first traversal of directories, while
// ignoring any directory child named "vendor".
//
// While filepath.Walk could have been used, that standard library function
// skips symbolic links, and for now, it's a design requirement for this
// function to follow symbolic links.
//
// WARNING: This function loops indefinitely when it encounters a loop caused by
// poorly created symbolic links.
func HashFromNode(pathname string) (hash string, err error) {
// bool argument: whether or not prevent file system loops due to symbolic
// links
return hashFromNode(pathname, false)
}
func hashFromNode(pathname string, preventLoops bool) (hash string, err error) {
h := sha256.New()
var fileInfos []os.FileInfo
// Initialize a work queue with the os-agnostic cleaned up pathname.
pathnameQueue := []string{filepath.Clean(pathname)}
for len(pathnameQueue) > 0 {
// NOTE: pop a pathname from the queue
pathname, pathnameQueue = pathnameQueue[0], pathnameQueue[1:]
fi, er := os.Stat(pathname)
if er != nil {
err = errors.Wrap(er, "cannot stat")
return
}
fh, er := os.Open(pathname)
if er != nil {
err = errors.Wrap(er, "cannot open")
return
}
// NOTE: Optionally disable checks to prevent infinite recursion when a
// symbolic link causes an infinite loop, because this method does not
// scale.
if preventLoops {
// Have we visited this node already?
for _, seen := range fileInfos {
if os.SameFile(fi, seen) {
goto skipNode
}
}
fileInfos = append(fileInfos, fi)
}
// NOTE: Write pathname to hash, because hash ought to be as much a
// function of the names of the files and directories as their
// contents. Added benefit is that empty directories effect final hash
// value.
//
// Ignore return values from writing to the hash, because hash write
// always returns nil error.
_, _ = h.Write([]byte(pathname))
if fi.IsDir() {
childrenNames, er := fh.Readdirnames(0) // 0: read names of all children
if er != nil {
err = errors.Wrap(er, "cannot read directory")
return
}
// NOTE: Sort children names to ensure deterministic ordering of
// contents of each directory, so hash remains same even if
// operating system returns same values in a different order on
// subsequent invocation.
sort.Strings(childrenNames)
for _, childName := range childrenNames {
switch childName {
case ".", "..", "vendor":
// skip
default:
pathnameQueue = append(pathnameQueue, pathname+pathSeparator+childName)
}
}
} else {
// NOTE: Format the file size as a base 10 integer, and ignore
// return values from writing to the hash, because hash write always
// returns a nil error.
_, _ = h.Write([]byte(strconv.FormatInt(fi.Size(), 10)))
_, er = io.Copy(h, fh)
err = errors.Wrap(er, "cannot read file") // errors.Wrap only wraps non-nil, so elide checking here
}
skipNode:
// NOTE: Close the file handle to the open directory or file.
if er = fh.Close(); err == nil {
err = errors.Wrap(er, "cannot close")
}
if err != nil {
return // early termination if error
}
}
hash = fmt.Sprintf("%x", h.Sum(nil))
return
}
@karrick i've actually spent way, way too much time thinking about the roiling hellscape that is symlinks. i've spent some time investigating extant symlink cycle detection algorithms, and they...suck. just, categorically, suck. turns out, graphs are hard.
however, for once, i think i have a happy answer to a symlink-related question! ๐
basically: the hasher should never, ever traverse symlinks. if it encounters one, then it should instead hash the contents/address of the link itself (basically, the output of os.ReadLink()
. this is essentially the behavior of pkgtree.ListPackages()
right now, which matters because that function is the one that creates a virtual representation of the filesystem that, in turn, is used to decide what things actually exist.
i believe this is sufficient for all cases because of some key invariants:
(it is not a coincidence that these invariants mirror those of snapshot-based version control systems)
i'm not entirely sure of this approach (i never am with symlinks, because symlinks ๐ฑ๐ฅ๐), so lemme walk it through.
symlinks can vary along a number of independent dimensions that are relevant for our purposes:
..
)as long as do not embed a guarantee in the hashing process that symlinks reaching outside of a project tree must exist/be valid, then we're fine to simply record the target of the symlink. if the symlink escapes the tree (which, if we want to defend against, is something we should enforce elsewhere), then we're making no correctness guarantee anyway. if it does not, then the structure of the hashing algorithm guarantees that the bits contained in/under the referent will be reached through normal, non-symlink means. given that, there's no added value to hashing the bits again; it's sufficient to simply record that a link exists.
the only real question in my mind with this is how we record the fact - in a cross-platform way - that a given file is a symlink in the hashing inputs itself, so that the hasher disambiguates between a normal file containing an address and a symlink with that same address.
@sdboyer, thanks for the feedback.
I spent a bit of time today looking at the surrounding dep
code looking for where I'd hook this into, but I did not run across it yet. But I did see that other parts of dep
were ignoring symlinks, and was curious about the necessity of following them in the hashing function if we ignore them elsewhere.
I'll update my fork tomorrow based on your feedback, and send a PR your way for the hashing code.
sounds great!
I spent a bit of time today looking at the surrounding dep code looking for where I'd hook this into, but I did not run across it yet.
yeah, i hadn't previously carved out a place for it. i have some basic thoughts on where to fit it in:
first and simplest would be to just add a workhorse func to internal/gps/pkgtree
: HashPackageTree(path string)
. i suspect that returning a []byte
is probably best for future flexibility, but i could be convinced of a fixed-length array, too.
then, we'll want something like gps.VerifyDepTree(basedir string, l Lock) (map[ProjectRoot]MismatchType, error)
. the basedir
would either be an actual vendor
dir, or at least have the same semantics.MismatchType
would basically be Go's anemic version of an enum:
type MismatchType uint8
const (
NoMismatch MismatchType = iota
EmptyDigestInLock
DigestMismatchInLock
NotInLock
NotInTree
)
i feel like these signatures and names may be enough to suggest the desired behavior, but lmk if more detail would be helpful. (also, we need a better name than MismatchType
, as that's too general for this special-purpose type)
eventually, i'd like to see this func called in a more flexible way from up in dep
- e.g., we dispatch the hasher in the background, prior to running the solver. we should be able to get some parallelism benefits there; both exert CPU pressure, but to the extent they are I/O-bound, the hashing code is disk-intensive, whereas the solver is often more network-intensive.
but, let's have a first PR just focus on these two functions, and the concomitant additions to gps.LockedProject
(it needs the new hash digest). once we're happy with those, we can actually start making use of the verification mechanism during normal operations.
Someone feel free to correct me if I'm wrong, but IIRC one of the reasons that file size prefixed its contents when git
creates a hash for the file was because SHA-1 was chosen because of its speed and "good enough" quality, despite not being a very long hash. However, in dep
we're free to use SHA-256, which is both fast and long, and it seems a bit cargo-cultish to include file sizes when creating a hash for its contents.
This comment doesn't really change the algorithm much, but I thought I'd point that out. We can always remove the file size from the hash in the future, and worst case people will re-download their dependencies another time.
because SHA-1 was chosen because of its speed and "good enough" quality, despite not being a very long hash. However, in dep we're free to use SHA-256, which is both fast and long, and it seems a bit cargo-cultish to include file sizes when creating a hash for its contents.
i'm very much neither a cryptographer nor a cryptanalyst, so i should start by categorically admitting that what's going on here is at least somewhat, if not entirely, cargo culting. and, on re-reading Linus' response to the SHA1 collision earlier this year, it seems like the length prefixing strategy may be less helpful in our case, as we do not have the equivalent of a blob object with headers (which include length) to generate, and then incorporate back up through the tree object, and into the commit object & the Merkle DAG.
then again, that might not be that relevant; if the attacker has some feasibly cheap attack on the hashing algorithm, then repeating it a few times for each of those levels probably isn't prohibitive.
i think the real question here is whether the necessity of generating correlated data could thwart a particular attack. that is, if the attacker has found some weakness in SHA256 such that they can write a data generation algorithm with a better-than-random chance of producing a targeted digest, then would forcing that algorithm to also consider the length of the data it is generating at each step undermine its effectiveness? basically, we're reducing degrees of freedom.
again, i'm not a cryptanalyst, and i've no idea what algorithms like these look like. but, even with so many more bits in SHA256 than SHA1 it seems like at least some attacks would probably be mitigated by this. so, given how easy it is for us to add the length prefix, and that there's no (?) downside for us in doing so, i'm still inclined to include it. (but, still very open to being convinced it's pointless ๐)
We can always remove the file size from the hash in the future, and worst case people will re-download their dependencies another time.
while it's definitely true that changing the algorithm would not have the same catastrophic effects as it would in git - again, we aren't building Merkle trees - i still don't think we should look at this as a road easily taken. there'll still be the body of committed Gopkg.lock
files with the old data, and we'll have to make the tool interoperate with them. that's a non-zero cost. so, i'd like to be as sure as we can reasonably be that we're happy with our implementation from right out the gate.
oh, interesting - again, SUPER out of my depth here, but perhaps it's even more important to do this, as including the length prefix may (???) thwart length extension attacks, and i believe that at least crypto/sha256 is based on SHA2/FIPS 180-2 which is known to be vulnerable to such attacks. (golang.org/x/crypto/sha3 does have new algos without such vulnerabilities, though)
ohhh - another, more pedestrian question - the impl you pasted in above appears to feed everything into one giant SHA256 hash writer for all the files in a tree. does that not mean that, while feeding the hasher, the algorithm will need to hold open memory equal to the size of all files in the tree? i'm now picturing us OOMing on, say, k8s.
if so, could we avoid that by mimicking git a bit further - hash each file individually, then build a tree of the digests + paths that, in turn, we hash for the final digest value?
@sdboyer, excellent question about one giant hash instance. It turns out that Go hash implementations do not store every byte ever written to them. They do have a few internal buffers used to chunk bytes to the hash in fixed size quantities, but SHA-256 has a 64 byte chunking buffer, and the 32 byte hash buffer, which is continuously updated as bytes are written to the hash.
In other words it is perfectly fine to instantiate a single hash, keep writing bytes, and to take the final sum when complete. For our particular use, I did not see the purpose of creating new hashes for every file system node, just to hash the results of their hashes into some master hash.
@sdboyer, when I answered your previous question, I had meant to throw a new question back your way.
I understand what you describe with your description of the proposed HashPackageTree
function, along with the MismatchType
enum-like construct. However, this feature involves creating a check that ensures whatever library is written in the vendor/
directory matches what is in the Gopkg.lock
file, by checking to ensure each library in vendor/
has a directory hash that matches some pre-computed value.
I think it is a safe assumption that libraries are the list of directories that are three levels down from the original vendor/
directory. Here's an example from the dep
project to illustrate the assumption I'm going to make:
The following directories would be considered libraries in the vendor directory, and a hash of their directories would be calculated and compared to what we will soon store in the Gopkg.lock
file:
vendor/github.com/armon/go-radix
vendor/github.com/go-yaml/yaml
vendor/github.com/Masterminds/semver
vendor/github.com/Masterminds/vcs
vendor/github.com/pelletier/go-buffruneio
vendor/github.com/pelletier/go-toml
vendor/github.com/pkg/errors
vendor/github.com/sdboyer/constext
The following directories are more than three levels down from the vendor/
directory, and would be ignored:
vendor/github.com/pelletier/go-toml/cmd
vendor/github.com/pelletier/go-toml/cmd/tomljson
vendor/github.com/pelletier/go-toml/cmd/tomll
If we cannot make this assumption, and code libraries could be at a different depth from the vendor/
directory, then I will have to rethink my current implementation plan. Let me know if you think this assumption is invalid.
@sdboyer, I agree with your point concerning reducing the degrees of freedom for a determined crypto attack. Thanks for bringing it up.
However, this feature involves creating a check that ensures whatever library is written in the
vendor/
directory matches what is in theGopkg.lock
file, by checking to ensure each library invendor/
has a directory hash that matches some pre-computed value.
yes - that's pretty much the flow i'm imagining for us :) is there an alternative?
I think it is a safe assumption that libraries are the list of directories that are three levels down from the original
vendor/
directory.
if only it were ๐ค
launchpad, which supports two-level roots (e.g. launchpad.net/gocheck
) is an obvious exception, but the actual ruleset makes no guarantee of three levels.
nevertheless, i don't think you need to rethink implementation much. while we can't know a priori how deep the root is along any given path, the root location is knowledge the rest of the system already has - and it's represented in the gps.Lock
, via Lock.Projects()
. the HashPackageTree(path string)
func i described will need to be called for each of those roots.
the only slightly tricky bit would be the NotInLock
case - when there's directories in vendor
that are not equal to or logical children of any of the roots given in the locked project list. still, that shouldn't be too bad - gps already makes liberal use of github.com/armon/go-radix
for tasks like these, and isPathPrefixOrEqual()
should get you most of the rest of the way. (it's also possible that we may not actually need to hash such dirs at all).
the only slightly tricky bit would be the
NotInLock
case
Exactly why I proposed the 3-layer down rule, but glad you responded with the launchpad.net
counter example so we could rule that out. Furthermore, if there are 2-layer down libraries, there equally might be 4-layer down libraries.
The problem is while walking the directory tree, identifying when the function comes upon a directory that is the top level of a library.
... is there an alternative?
I think I might have one. Walk the vendor/
tree with the project dependency list in hand. When a particular directory matches one of the dependencies declared in the list, then hash that particular directory as one would normally do, to ensure its contents match what is stored in the Gopkg.lock
file. Every other file system node will be put in a list of extras, and that extra list ought to be empty. dep ensure
could either return an aptly worded error message or chose to remove the file system nodes in that list, to ensure nothing can pollute the project build.
While being able to logically group extra directories into libraries might be desired, there is no way to identify where the top level of a library begins without heuristics, such as looking for the presence of a README.md
file, a LICENSE
file, a CONTRIBUTING
file, or maybe even one of the VCS directories.
Perhaps we consider using a hybrid approach where directories matching one of the dependencies in Gopkg.lock
would be called out as a dependency, yet we are free to use heuristics for extra directories found in vendor/
.
Furthermore, if there are 2-layer down libraries, there equally might be 4-layer down libraries.
Or six. Or ten. There are no guarantees here. go get
HTML metadata, or infixed vcs extensions, allow this position to be arbitrary. The problems arising from this is one of the reasons why i'd like to see fixed-length roots for registries - https://github.com/golang/dep/issues/174#issuecomment-275589961
The problem is while walking the directory tree, identifying when the function comes upon a directory that is the top level of a library.
Ah, yes, so, this is one crucial bit.
dep makes the assumption that the repository root is the same as the project root (for dependencies - you can make things work a little differently if e.g. you want to manage a monorepo that's never imported). This equivalency relationship is defined in large part because it makes figuring out where project roots are vastly easier. We just have to figure out the repo root - which means relying on import path deduction in the same way go get
does today.
That deduction behavior is encapsulated in SourceManager.DeduceProjectRoot()
. That method is generally quite optimized (and all *SourceMgr
method impls are threadsafe), and so is safe to call under normal circumstances.
...though I think that calling it on the leading elements of a path (e.g. ghe.somecompany.com/sdboyer
but not yet ghe.somecompany.com/sdboyer/gps
) could end up inducing an HTTP request to look for go get
metadata, and then not cache the failure to retrieve it. The system wasn't optimized for iteratively walking down paths in this way.
In any case:
there is no way to identify where the top level of a library begins without heuristics
We don't have to rely on heuristics for this, because the decision is made entirely on the basis of import path structure. That makes classification much, much easier.
to ensure nothing can pollute the project build.
Another crucial bit, here - dep
has the viewpoint that it wholly owns the vendor/
directory, and today dep ensure
blows it away with great prejudice. While this verification logic will allow us to be more selective in modifying vendor/
, we're not changing the basic assumption of complete ownership. In the land of dep
, there is only one known good state for vendor/
: exactly what is described in Gopgk.lock
, nothing less, and nothing more.
Perhaps we consider using a hybrid approach where directories matching one of the dependencies in Gopkg.lock would be called out as a dependency, yet we are free to use heuristics for extra directories found in vendor/.
Because of the two "crucial bits" above - we have a clear system for deciding where roots are, and we aggressively and non-optionally trim detritus from vendor/
- I think that we can generally get away without giving
Walk the vendor/ tree with the project dependency list in hand. When a particular directory matches one of the dependencies declared in the list, then hash that particular directory as one would normally do, to ensure its contents match what is stored in the
Gopkg.lock
file. Every other file system node will be put in a list of extras, and that extra list ought to be empty.
This also sounds generally fine, though it has the issue that you'll need to keep some kind of a trail - i.e., the dir github.com
itself will need to be marked as an extra until dependencies beneath it are found that match entries in Gopkg.lock
.
Off the top of my head, an approach i might take using a radix tree would be:
Lock.Projects()
, and for each of the root path entries in there.../
, inserting all the leading elements with a marker indicating "prefix of known root, but not a root" - say, 1
2
vendor/
, checking for the longest prefix match against the radix tree
i. If it's a 2, we're on a root - stop walking down and call the hasher
ii. If it's a 1 and the current path == prefix, add any local non-directory files to the extra list, then keep walking
iii. If it's a 1 and the current path is longer than the prefix, stop walking and record the whole dir path as an extra
iv. If there's no match at all, stop walking and record the path as an extraUnless i've missed something, this approach would obviate the need for doing root deduction at all. It does, however, mean that some context may be lost if we were to choose to show any messages - e.g., if the last project from bitbucket.org
is being removed, then messages will only reflect that bitbucket.org
is being removed, not the actual project. I don't think that's a big problem, though.
@sdboyer,
I've got some interesting preliminary benchmarks from a new custom directory walking function.
First we establish a baseline using the find
utility.
[karrick@linux walk-stdlib]$ time find ~ >/dev/null
real 0m17.218s
user 0m2.437s
sys 0m5.279s
Then we walk the directory tree using the standard library filepath.Walk
function.
[karrick@linux walk-stdlib]$ cd ..
[karrick@linux examples]$ time walk-stdlib/walk-stdlib ~ >/dev/null
real 0m20.882s
user 0m7.407s
sys 0m13.117s
Then we walk the directory tree using a new faster walk function.
[karrick@linux examples]$ time walk-fast/walk-fast ~ >/dev/null
real 0m10.332s
user 0m10.557s
sys 0m2.827s
I know some of that is due to virtual file system caching, but it's at least a promising start.
After many more benchmarks, the find
utility finishes the same directory structure in less than 4 seconds. So what I posted above was not accurate because the OS caches were not primed.
However, the fast walking example consistently performs at 2x speed as compared to the version that uses the standard library filepath.Walk
function.
The modified code works well on Linux and Mac, but not on Windows, because of different syscalls available, as expected. I think godirwalk
ought to have two different code paths with correct build flags, so it naturally runs on different operating systems.
I have extracted the directory walking logic to https://github.com/karrick/godirwalk, which now works with unix and Windows. Incidentally it even corrects a subtle bug on Windows over the standard library. On Windows, a symlink to a directory has multiple type mode bits set. The logic in filepath.Walk
fails to detect the situation and attempts to follow those symlinked directories. By reversing the bit checks, the godirwalk project correctly detects a symlinked directory and either ignores it or follows it, depending on how the walk function is invoked. I have updated my local copy of dep to use gordirwalk.Walk
and all tests still pass. Weren't there a few tests that had to be deactivated on Windows?
sounds like a good time for a PR to switch out our internal impl and use godirwalk for the hasher, then ๐
Just wondering what the state of this issue is - it seems like @karrick came up with a cross-platform method of creating a hash of each vendored dependency. What steps have to be taken next to push resolution of this important issue forward?
@sdboyer in nixos the problem for generating a hash of a GIT repository is already solved. i've added some notes here: https://github.com/nixcloud/dep2nix#golang-dep-git-repo-hash-extension
i love the dep
effort and using dep2nix
it is really easy now to finally nail down all the dependency hashes to package a go project. but here is the catch:
using nix
for go deployment we have to build a Gopkg.lock
-> deps.nix
manually, each time the project dependencies change. manually
means lots of effort for the maintainer of the package.
here is the difference to yarn:
yarn.lock
adds a sha256 hash per dependency (tar.bz2) so we can generate a yarn.nix file on the flyGopkg.lock
lacks a hash (GIT repo) as there is no way to serialize a GIT repository cross platformthat said, using the NAR file specification along with the store concept
and the hash
implementation we have with nix, we pretty much already have all the tools dep
would need to make generate a hash. should we now use that c++
code and create a go library from it which can be used by dep
?
libraries needed for building the hash (implemented as go library using cgo
):
libnixutil
and libnixformat
, see https://lastlog.de/blog/posts/media/nix-instantiate-dependencies.jpg@sdboyer do you have any license concerns using nix
related code as a dependency?
Finally got this all wrapped up!
@qknight i appreciate the offer (and my apologies for not responding sooner), but we had a planned roadmap for all of this already, and have had a CRLF-normalizing hasher in place since last fall. That's what i utilized when implementing #1912.
dep needs a way of verifying that the contents of
vendor/
are what's expected. This serves three purposes:dep status
is very limited in what it can report aboutvendor/
if it can't verify thatvendor
is actually what's expected.The simple, obvious solution is to deterministically hash (SHA256, I imagine) the file trees of dependencies, then store the resulting digest in the lock. However, the need to perform vendor stripping (#120) complicates this, as that creates perfectly valid situations in which the code under
vendor/
might differ from the upstream source. The more choice around stripping modes we create, the more complex and error-prone this challenge becomes.Or, we might simply rely on the hashing performed by the underlying VCS - indeed, we already record "revisions" in the lock, so we're already sorta doing this. Relying on it is triply disqualified, though:
The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after). This would allow us to be more adaptive about assessing package integrity - e.g., we can know that it's OK if certain files are missing.
The downside is that these lists of file hash digests would be potentially huge (e.g., if you import k8s), though that might be mitigated by placing them under vendor/ rather than directly in the lock - sdboyer/gps#69. Also, without the larger security model in place, I don't know if disaggregating hashing to the individual file level might compromise some broader notion of integrity that will prove important.
We might also record just the project unit-level hash against some future day where we've reinvented GOPATH and are no longer relying on vendor.