spdx / ntia-conformance-checker

Check SPDX SBOM for NTIA minimum elements
Apache License 2.0
47 stars 18 forks source link

check_dependency_relationships test does not seem correct #186

Open DanielOjalvo opened 1 month ago

DanielOjalvo commented 1 month ago

Hello,

It seems like the "check_dependency_relationships" test does not conform to the specification in 2.3. To my understanding, there are no requirements for an SPDX ID besides uniqueness and starting with SPDX-Ref". (See: https://spdx.github.io/spdx-spec/v2.3/package-information/)

From what I'm understanding a relationship between the SPDX Document and a Package in the document with type "DESCRIBE" is required in the "relationships" list. "An SPDX document WildFly.spdx describes package ‘WildFly’. Note this is a logical relationship to help organize related items within an SPDX document that is mandatory if more than one package or set of files (not in a package) is present." https://spdx.github.io/spdx-spec/v2.3/relationships-between-SPDX-elements/

As we can see, the check for this test looks for the literal substring "Package" in the related_spdx_element_id. https://github.com/spdx/ntia-conformance-checker/blob/main/ntia_conformance_checker/sbom_checker.py#L74

Did I miss something in the specification or should this test be different?

jspeed-meyers commented 1 month ago

@DanielOjalvo, thank you for the bug report!

Let me see if I understand.

Are you saying that the check_dependency_relationships function is incorrect? And you're saying it's incorrect because

https://github.com/spdx/ntia-conformance-checker/blob/c9b48a496316fe0c307c35a5f083a14a08687f83/ntia_conformance_checker/sbom_checker.py#L74

should actually be something like:

spdx.package [??] in rel.related_spdx_element_id for rel in describes_relationships

?

This is because, as you describe:

the check for this test looks for the literal substring "Package" in the related_spdx_element_id.

And you're suggesting a check for the literal substring `Package" is incorrect, right?

Do I have it right?

When I read the SPDX specification for this NTIA check, I see this language:

The document must DESCRIBES at least one package.

I almost see the error, though I admit to being a little confused as I look at this. PRs are most welcome! A PR might also make your proposal more concrete to me and teach me!

Anyways, thank you for the bug report!

DanielOjalvo commented 1 month ago

Hi @jspeed-meyers, thank you for the quick response! Yes, I think we're on the same page.

From the description of an SPDX identifier, it just gives the desired format below:

"SPDXRef-"[idstring] where [idstring] is a unique string containing letters, numbers, ., and/or -.

https://spdx.github.io/spdx-spec/v2.3/package-information/#721-description

I think all you would really need to do is create a set of the SPDX IDs of all package SPDX IDs and check if any rel.related_spdx_element_id exists in that set. I'll try to make a PR if I have time.

Respectful sidenote: I don't understand why this requirement exists in the first place. It would seem obviously inferable that a document describes all of the packages in its "packages" field. Is there a particular reason why this is required?

Again thank you for your response and I appreciate the work!

jspeed-meyers commented 1 month ago

Again thank you for your response and I appreciate the work!

Right back at you. I appreciate the bug reports.

Respectful sidenote: I don't understand why this requirement exists in the first place. It would seem obviously inferable that a document describes all of the packages in its "packages" field. Is there a particular reason why this is required?

I defer to @goneall on all things SPDX spec!

I think all you would really need to do is create a set of the SPDX IDs of all package SPDX IDs and check if any rel.related_spdx_element_id exists in that set. I'll try to make a PR if I have time.

PR welcome. And if anyone else wants to create a PR, feel free to create a comment here and then send a PR.

jspeed-meyers commented 1 month ago

Also, I think this will be a breaking change. If anyone disagrees, please say something!

goneall commented 3 weeks ago

Respectful sidenote: I don't understand why this requirement exists in the first place. It would seem obviously inferable that a document describes all of the packages in its "packages" field. Is there a particular reason why this is required?

I can see your point - it is a different use or interpretation of "describes".

The document describes is intended to help the consumer identify the "root" of the graph or tree of packages - perhaps "focus" or "starting point" would be a better term. An example where this may be useful - you have an SBOM describing package A and a number of dependencies of various types (e.g. runtime, test, documentation) on other packages also included in the SPDX document. Knowing that the document "describes" the starting point as package A, you can now traverse the graph of dependencies making intelligent decisions on what dependencies will impact things like security or licensing. Without the describes, you may think a test dependency will definitely impact security when in fact it (most likely) does not.

kestewart commented 3 weeks ago

Reason for use of "describes" was the NTIA minimum needed to point to a "root" element in an SBOM document. Challenge was that an SPDX document could have multiple "roots" in it, for an SBOM there was the desire to have only a single point per document. (https://spdx.github.io/spdx-spec/v2.3/how-to-use/#k2-satisfying-ntia-minimum-elements-for-an-sbom-using-spdx)

DanielOjalvo commented 3 weeks ago

I created a pull request. It still passes current tests. I would add another one, but I am not sure of the framework.

https://github.com/spdx/ntia-conformance-checker/pull/189

jspeed-meyers commented 3 weeks ago

@DanielOjalvo -- For adding tests. I always welcome new tests. Question: Do you think you OUGHT to add a test? Is there a benefit to the developers of this project, especially future developers, or users of the project? If you think so, then I do recommend adding a test.

DanielOjalvo commented 3 weeks ago

I'm not sure. I was thinking about adding one for this particular issue, but I'm not sure how the tests are structured.

jspeed-meyers commented 3 weeks ago

The tests are here: https://github.com/spdx/ntia-conformance-checker/blob/main/tests/test_checker.py

Feel free to take a look and to ask questions! Also, if you don't think it's helpful, then there's no need to add a test.

DanielOjalvo commented 1 week ago

I was giving more consideration towards this issue. I realized that this causes a blocking problem for how we are currently structuring our SBOMs.

Basically, we are developing a product that deploys as a pod of containers. Our SBOM strategy has been to create a tarfile of a directory of SBOM files with an SBOM file at the root using the "externalDocumentRefs" field to point to sub-SBOMs (spdx files) for each container in the product.

I can create another issue for this, but I was wondering if anyone had suggestions for how to structure SPDX files in a container environment or is doing something similar.

To be specific, the "root" SBOM would only have "externalDocumentRefs", but not packages. This seems like it would break the existing requirements.

jspeed-meyers commented 1 week ago

cc @puerco on containers and SBOMs

Let me think about this @DanielOjalvo. hmmm....

jspeed-meyers commented 4 days ago

To be specific, the "root" SBOM would only have "externalDocumentRefs", but not packages. This seems like it would break the existing requirements.

I thought about this. First, this does seem like a separate issue. The question to me is: should such an issue be associated with this repository? I view this codebase as trying to faithfully implement this table. I think you are saying (but please tell me if I mis-interpreting it) that you would like to potentially redefine (or perhaps expand) the definition of the "Relationship" field in that table. If you agree with that perspective, then I think an issue might be more appropriate opened on https://github.com/spdx/spdx-spec with a proposal to amend the table I linked above.

Thoughts?

DanielOjalvo commented 1 day ago

It's a difficult question. I think the relationships should be an optional field and we develop standard ways to infer data from an SBOM that doesn't need to be captured in a relationship. The problem with capturing "all" relationships is that combinatorial blowups can blow up pretty quickly from my experience. Even without relationships, SBOMs are useful enough as a way to determine what a given product "contains". Relationships, while powerful, aren't necessarily required to get use out of an SBOM. I'm not sure how tortured a reading of the minimum elements document is necessary to get to that conclusion. ;)

The main reason that we are using an ENTRYPOINT file with "externalDocumentRefs" is that it's a useful way to logically group our SBOMs based on the containers within a delivered product. The secondary reason that ENTRYPOINT files may be useful as a way to break up the data into manageable-sized files.

Another potential idea would be to enlarge the scope of SBOM "minimum elements" requirements to apply to a collection of SBOMs released for a given product no matter how it's logically organized.

puerco commented 1 day ago

OK, some thoughts from me:

It would seem obviously inferable that a document describes all of the packages in its "packages" field. Is there a particular reason why this is required?

No, this is not correct, as Kate and Gary pointed out above, the DESCRIBES type points out the root element(s) in the SBOM. A document can list hundreds of packages. One (or a few of them) are the software piece(s) described by the SBOM. Without this relationship from the doc to the root elements, you cannot know the entrypoint to the SBOM graph.

The relationships should be there because you need to know which packages relate to which of the roots. Or their dependencies, or their dependencies... and so on until the last leaf node (package or file) of the graph. There is no inferring, you cannot assume any relationships from looking at the data in the packages.

combinatorial blowups can blow up pretty quickly from my experience.

Indeed, and this is what you want: a richer graph with good resolution to understand how software is composed. It's better to have a supercharged sbom and dumb it down if a tool cannot process it than missing data that translates to blind spots on areas of the software composition.

Regarding your SBOM structure:

To be specific, the "root" SBOM would only have "externalDocumentRefs", but not packages. This seems like it would break the existing requirements.

OK, this is an interesting use case. I'm not sure it breaks anything in the spec but it's weird.

The ExternalDocumentRef element in the SBOM is not part of the SBOM graph itself. The SBOM data graph is composed of Packages and files connected with relationships. Documents are transparent containers that only show up when crafting the element identifier for the relationships in other documents.

The ExternalDocumentRef element is meant to provide data to validate any documents that need to be pulled when referencing Packages or Files in other SBOMs. If you create a relationship to a package in another file, and that file does not have an ExternalDocumentRef providing its hash and ID to validate it, you will get a validation error. Adding an ExternalDocumentRef without referencing any package or file or package in the document effectively means nothing.

Once a document is "loaded", that is to say, that it is accounted for to be validated with an external reference, only then you can use its namespace to point to packages and files in it (or to the document itself)

with an SBOM file at the root using the "externalDocumentRefs" field to point to sub-SBOMs

I think what you are trying to create here is an SBOM that serves as a manifest. But this approach is effectively meaningless semantically. The ExternalDocRefs can serve as a pointer to the sub-sbom locations and may be useful to tools that understand your specific use case but semantically that is an empty SBOM.

You could create a semantically meaningful root SBOM that serves as a manifest by giving it this structure:

- Root Package (represents the tarfile, cannot be hashed as the root SBOM file is included in it)
  - CONTAINS -> https://docnamespace1/SPDXRef-DOCUMENT
  - CONTAINS -> https://docnamespace2/SPDXRef-DOCUMENT
  - CONTAINS -> https://docnamespace3/SPDXRef-DOCUMENT

- ExternalRef1 (document1)
- ExternalRef1 (document2)
- ExternalRef1 (document3)

This adds a graph to the root SBOM and makes it semantically valid. In this case, each sub-SBOM would still be an independent graph that knows nothing of each other or the root.

But, on the other hand, from the root SBOM's point of view (from its namespace), you can now reference any package or file in the sub SBOMs as they are now addressable.

DanielOjalvo commented 11 hours ago

Thanks for the response. I like that idea.

As I'm understanding it, from this perspective, we would treat the software release file (say, a container, ISO, OVA, etc) as a "package" and use the DESCRIBES relationship to describe that package, then add CONTAINS to add references to the sub-SBOMs.

Have I got the gist correctly?

Would DESCRIBED_BY possibly make more sense? It seems like using CONTAINS might be overloading the term. CONTAINS is defined as "...to be used when SPDXRef-A contains SPDXRef-B" which would seem to refer specifically to files/packages within a package. The document wouldn't necessarily be delivered within the product. But, DESCRIBED_BY is defined as "...to be used when SPDXRef-A is described by SPDXREF-Document."

Indeed, and this is what you want: a richer graph with good resolution to understand how software is composed.

I think that's a bigger problem than recognized. This example is a bit contrived, but bear with me. Let's say we want to completely fill out the "CONTAINS" relationship structure. Package A CONTAINS Package B and then Package B CONTAINS Package C. To fully capture all CONTAINS, then you would need 3 relationships:

(Package A, Package B, CONTAINS) (Package B, Package C, CONTAINS) (Package A, Package C, CONTAINS)

With N packages, you end up with O(N^2) as a worst case upper-bound depending on the structure of the binary relationship being captured. On a product with, say, 1000 packages, you can potentially end up with on the order of a million extra relationships to be contained in the SBOM. Memory is cheap, but in practice this can make for extremely unwieldy files to process. I would argue for a paradigm where we provide only enough data about a software product to where that can be used to calculate the answer to questions about the SBOM presented. Such as, is there a path from Package A to Package C using the CONTAINS relationship?