Closed lumjjb closed 6 months ago
@loosebazooka fyi!
Just some notes from chatting with Brandon earlier.
graph TD;
App1-->X;
App1-.->X:1.0.0:hash1
App2-->X;
X-->X:1.0.0;
X:1.0.0-->X:1.0.0:hash1;
X:1.0.0-->X:1.0.0:hash2;
App2-.->X:1.0.0:hash2
X:1.0.0:hash1-->A
X:1.0.0:hash2-->B
@knrc mentioned that he will look into this!
@lumjjb Please assign this issue to me and I'll start working on it later today
This issue has been resolved by https://github.com/guacsec/guac/pull/1367
The snowflake package identifier problem
This issue is being created from a problem raised by @knrc and @dejanb around the SBOMs for the Java ecosystem. This however, a similar issue has also been seen with the debian ecosystem as to how SBOMs express package identifiers. Thus, we will give a more generic name to this problem: the “snowflake package identifier” problem (name is up for change, but I thought it is kind of representative of the issue, since we are talking about the same package used in different context which makes them slightly different in a subtle way).
The problem
Java example
In Java, you are able to express that you want to use a package, but not include some of its transitive dependencies, and use a different library instead. This is not dissimilar to scenarios where for compliance reasons, sometimes certain FIPS approved cryptographic libraries need to be used, and thus during compilation we choose to not include the original library but a compatible one with all required symbols instead. The way this can be expressed in Java is talked about in more detail in this issue created by @knrc.
The reason this gets tricky is that between two builds of a java application that use the same package A, each of them can be using different transitive packages (via package overriding). However, both applications refer to package A by the same name and version. This is a problem in SBOMs and GUAC because when represented in the data model, there is no way to differentiate between the instances of the two usages of package A.
To illustrate, we have two build java applications:
The main issue around this is the identifiers that are used for the packages are PURLs, and the amount of semantic meaning for a PURL varies. Some PURLs provide a way to verify the content of the package and its descendants, and some do not. This results in two separate package use that has different descendants end up being aggregated and not able to allow one to properly reason about the use of the package in different context.
For example, in this graph, App1 and App2 after ingestion point to X, but because the identifier of package X does not distinguish between both uses, they are no longer distinguishable. This is not only an issue in GUAC but in any system that wants to consume SBOMs (https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/306).
As a tangent, one may argue that perhaps the SBOM should be expressed as a flattened list of dependencies and should be represented as a 1 layer tree (or a star node). However, this also misses certain contexts that can be used for vulnerability remediation. So even if it is, it is helpful to express the dependency use relationships.
Debian example
Another example that we ran into was with Debian, in this case, there were some container images where the information from the package manager was used to describe a package, but in certain containers, minimization was done and so certain files were not present in certain package use. However, because there was no way to express the difference in the inventory of the package, we end up with the same issue where a container now depends on a package which says it contains a certain file, but it does not.
Proposed Solution(s)
In the CycloneDX PR (https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/306), the proposal is to add a hash to the reference which acts as a merkle tree of PURLs which a pkg depends on.
In GUAC, we can take a similar approach where we can perform a hash on descendants of a package when parsing the SBOMs. And express them in our pkg data model as a qualifier (which are used to express specific instances of a library). This can be done via taking the serialization of GUAC pkg predicates for descendants and use that hash as a qualifier via a merkle tree hash by pkg serialization lexical order.
The ideal situation is that the Java ecosystem would encode a way to differentiate between such instances or provide the identifiers to do this analysis. Possibly as a qualifier on a PURL.