lumjjb commented 1 year ago

The snowflake package identifier problem

This issue is being created from a problem raised by @knrc and @dejanb around the SBOMs for the Java ecosystem. This however, a similar issue has also been seen with the debian ecosystem as to how SBOMs express package identifiers. Thus, we will give a more generic name to this problem: the “snowflake package identifier” problem (name is up for change, but I thought it is kind of representative of the issue, since we are talking about the same package used in different context which makes them slightly different in a subtle way).

The problem

Java example

In Java, you are able to express that you want to use a package, but not include some of its transitive dependencies, and use a different library instead. This is not dissimilar to scenarios where for compliance reasons, sometimes certain FIPS approved cryptographic libraries need to be used, and thus during compilation we choose to not include the original library but a compatible one with all required symbols instead. The way this can be expressed in Java is talked about in more detail in this issue created by @knrc.

The reason this gets tricky is that between two builds of a java application that use the same package A, each of them can be using different transitive packages (via package overriding). However, both applications refer to package A by the same name and version. This is a problem in SBOMs and GUAC because when represented in the data model, there is no way to differentiate between the instances of the two usages of package A.

To illustrate, we have two build java applications:

App 1 uses Package X that uses a transitive package A.
App 2 uses Package X but overrides package A with package B.
When GUAC ingests the two pieces of information about both software packages

The main issue around this is the identifiers that are used for the packages are PURLs, and the amount of semantic meaning for a PURL varies. Some PURLs provide a way to verify the content of the package and its descendants, and some do not. This results in two separate package use that has different descendants end up being aggregated and not able to allow one to properly reason about the use of the package in different context.

For example, in this graph, App1 and App2 after ingestion point to X, but because the identifier of package X does not distinguish between both uses, they are no longer distinguishable. This is not only an issue in GUAC but in any system that wants to consume SBOMs (https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/306).

As a tangent, one may argue that perhaps the SBOM should be expressed as a flattened list of dependencies and should be represented as a 1 layer tree (or a star node). However, this also misses certain contexts that can be used for vulnerability remediation. So even if it is, it is helpful to express the dependency use relationships.

Debian example

Another example that we ran into was with Debian, in this case, there were some container images where the information from the package manager was used to describe a package, but in certain containers, minimization was done and so certain files were not present in certain package use. However, because there was no way to express the difference in the inventory of the package, we end up with the same issue where a container now depends on a package which says it contains a certain file, but it does not.

Proposed Solution(s)

In the CycloneDX PR (https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/306), the proposal is to add a hash to the reference which acts as a merkle tree of PURLs which a pkg depends on.

In GUAC, we can take a similar approach where we can perform a hash on descendants of a package when parsing the SBOMs. And express them in our pkg data model as a qualifier (which are used to express specific instances of a library). This can be done via taking the serialization of GUAC pkg predicates for descendants and use that hash as a qualifier via a merkle tree hash by pkg serialization lexical order.

The ideal situation is that the Java ecosystem would encode a way to differentiate between such instances or provide the identifiers to do this analysis. Possibly as a qualifier on a PURL.

lumjjb commented 1 year ago

@loosebazooka fyi!

loosebazooka commented 1 year ago

Just some notes from chatting with Brandon earlier.

X is probably versioned (1.0.0, 2.1.3, etc), so should X:1.0.0 have it's own subtrees for each resolved (A vs B resolution strategy)

graph TD;
App1-->X;
App1-.->X:1.0.0:hash1
App2-->X;
X-->X:1.0.0;
X:1.0.0-->X:1.0.0:hash1;
X:1.0.0-->X:1.0.0:hash2;
App2-.->X:1.0.0:hash2
X:1.0.0:hash1-->A
X:1.0.0:hash2-->B

This problem is not limited to java, this strategy has to work for all ecosystems (go.mod has replace, etc)

lumjjb commented 1 year ago

@knrc mentioned that he will look into this!

knrc commented 1 year ago

@lumjjb Please assign this issue to me and I'll start working on it later today

lumjjb commented 6 months ago

This issue has been resolved by https://github.com/guacsec/guac/pull/1367

guacsec / guac