slsa-framework / slsa-verifier

Verify provenance from SLSA compliant builders
Apache License 2.0
216 stars 45 forks source link

Support directory hashes #730

Open laurentsimon opened 6 months ago

laurentsimon commented 6 months ago

As part of the effort to bring SLSA to ML https://github.com/google/model-transparency, we need to be able to sign directories. This requires the definition of a new "hash", i.e. how to serialize a directory. We have a PoC for this in the repo linked above, and need to implement it in slsa-verifier

laurentsimon commented 6 months ago

/cc @mihaimaruseac @ramonpetgrave64

laurentsimon commented 6 months ago

@smeiklej

netomi commented 5 months ago

jsonnet-bundler has a small utility method to generate the hash of a directory which might be useful here as well: https://github.com/jsonnet-bundler/jsonnet-bundler/blob/master/pkg/packages.go#L351

laurentsimon commented 5 months ago

this code is not safe from a cryptographic hash point of view, e.g. you can rename files to change their meaning. The hash we have in the model repo also handled parallel hashing using a tree. See comments in https://github.com/google/model-transparency/issues/49

laurentsimon commented 5 months ago

An even greater problem with the hash is that it lacks delimiters between files. So the two following directories will produce the same hashes: F1: "hello" F2: "world"

will produce the same hash has: F1: "hell" F2: "oworld"

netomi commented 5 months ago

ok I did not realize that the directory hash should be also taking that into account.

Maybe tree hashes as calculated by git would be useful. Here is some test that I performed by creating a file with the same content but different filename in different directories and how the hash would be calculated by git.

If the filename is equal, the hash is the same, if the filename differs, also the hash differs.

tn@proteus:~/workspace/eclipse/EclipseFdn/tmp$ git ls-tree HEAD
040000 tree 1e6dbf97adb05c42dcb537cd717e368812dc23b5    test
040000 tree 844053933521d6c52f2f96e288dc9175a2e6aea0    test2
040000 tree 1e6dbf97adb05c42dcb537cd717e368812dc23b5    test3

tn@proteus:~/workspace/eclipse/EclipseFdn/tmp$ git ls-tree -r HEAD
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238    test/test.txt
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238    test2/test2.txt
100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238    test3/test.txt
mihaimaruseac commented 5 months ago

This could work but forces existence of a .git directory and ties to git hashing algorithm.

netomi commented 5 months ago

Sorry for the misunderstanding, I did not intend to suggest to use git itself, but rather its mechanism to generate tree hashes.

mihaimaruseac commented 5 months ago

Oh, fair point. Thanks for clarifications.

ramonpetgrave64 commented 5 months ago

Just adding to the conversation:

merkle trees seem like they could be a good way to hash directories, and someone has tried this in go.

re: your comments, I think we could add an aptional CLI switch to slsa-verifier like --enforce-subject-name-and-path, and then the if the slsa-github-generator doens't already, it could put the relative paths in the subject.name.

mihaimaruseac commented 5 months ago

Thank you! We're now also experimenting with a manifest file instead of a hash of everything, but probably this won't work for SLSA (https://github.com/google/model-transparency/issues/111). Let's continue experimenting

laurentsimon commented 5 months ago

SLSA will replace the manifest format by a provenance format, the rest probably can remain the same