google / deps.dev

Resources for the deps.dev API
https://deps.dev
Apache License 2.0
242 stars 18 forks source link

Request gitoid:sha1 and gitoid:sha256 as hash types for query api #14

Open edwarnicke opened 1 year ago

edwarnicke commented 1 year ago

There's a standing problem with folks using scanners to try to determine SBOMs. They produces lots of false positives. False positives lead to lots of wasted effort trying to rule out CVEs from those false positives. OmniBOR would allow capturing the precise artifact dependency graph from source files up. What would be needed to convert this into an SBOM would be the ability to map the hash of the 'leaves' (source files) to (component name, version, supplier) tuples. Naturally, deps.dev's query API looks like a great solution to that. OmniBOR uses gitoids as identifiers, because the most interesting artifacts in the artifact dependency graph, the leaf source code files, are typically stored in git, and indexed by gitoid.

Currently https://docs.deps.dev/api/v3alpha/#query supports many different hash types. This is good :)

It would be very useful if it could support Git Object IDs (gitoids) as a hash type.

Today git supports two kinds of gitoids - gitoid:sha1 and gitoid:sha256. gitoid:sha256 was recently introduced, with the option per repo to use it. As of yet it has seen little use. Therefore its important to support both gitoid:sha1 and gitoid:sha256

gitoid for blobs are easy to compute. You simply prepend the 'git object header' to the file contents and compute the hash (either sha1 or sha256) over the result. A 'git object header' for a blob is 'blob␣${size}\0'. Where '␣' represents the UTF-8 character 0x20 and '\0' represents the null character 0. ${size} is the number bytes of ${content} represented as a string base 10.

Simple golang gitoid computation code can be found here for reference. Further checks can be done using the git hash-object command.

edwarnicke commented 1 year ago

Oh... also... there is a URI scheme for gitoids if that proves helpful.

sschuberth commented 1 year ago

As a side note, Software Heritage's SWHIDs also are basically just gitoids, and having compatibility here would be great.

sarnesjo commented 1 year ago

Hi @edwarnicke! If I've understood your feature request correctly, you want to query by a hash of a source code file and get a list of matching package versions. If so, I agree that would be neat, but it's not something we can currently support (regardless of whether the hash is expressed as in the current Query endpoint or using a gitoid). For the most part, we don't have a reliable link between a package version and the repo commit it was built from. The exceptions are Go (where the repo is the distribution format) and the small-but-growing number of npm package versions for which SLSA provenance attestations are available. We are working on expanding our support for that, however, so hopefully this will eventually be a feature we could support.

sschuberth commented 1 year ago

you want to query by a hash of a source code file and get a list of matching package versions. If so, I agree that would be neat, but it's not something we can currently support

The last sentence confused me as the docs of https://docs.deps.dev/api/v3alpha/#query sounded as if that already was supported. But I guess the term "content hash" refers to something else than hashes of files (no matter by which algorithm). On the other hand, further docs say "hashes are matched against multiple artifacts that comprise package versions, and any given artifact may appear in many package versions", which again does sound as if "artifacts" were files.

@sarnesjo could you maybe clarify what the supported content / artifact hashes are? Like, in the case of maven, would it be the hash of the binary / source JAR?

sarnesjo commented 1 year ago

Right, that wording could be more clear–I'll update the docs. What it should say is "hashes are matched against multiple release artifacts that comprise package versions". The exact meaning of this varies from system to system. For Maven, yes, it's the various .jar (and .war, etc) files uploaded to one of the Maven repositories that we track.

As I understood this feature request, it's about matching source code files (e.g. .java files).

sschuberth commented 1 year ago

As I understood this feature request, it's about matching source code files (e.g. .java files).

Right, but I have a hunch that this request was made out of the same misunderstanding of the API as I had: In @edwarnicke's view source code files seem to be artifacts (and leaves of the dependency tree), and the API sounded as if hashing of such "artifact" already is supported, and just gitoids would need to be added as an alternative hash algorithm.

Anyway, let's see what @edwarnicke responds 😀

PS: Personally, I don't share the view that source code files are the leaves of dependency trees. They're just the building blocks the leave artifacts are made of.

Matthiasvanderhallen commented 10 months ago

For java, besides the .jar and .war files on the one hand, and the .java source code files on the other, there's also the compiled .class files that one could hash and query. I imagine being able to query on the level of .class files would be useful when encountering fat jars?