Where do we start in a serialized JSON-LD?

goneall commented 1 year ago

In the current JSON-LD example, all elements are a flat array - If I receive a file, how do I know where to start? If I send an SBOM along with a package, there my be 10's of thousands of elements including SBOM's within SBOM's. How do I know which element to start with?

In SPDX 2.X the SPDX document is 1:1 with the serialization and has a describes relationship which give you starting point(s).

goneall commented 1 year ago

One possibility is to restrict JSON-LD serializations to one and only one ElementCollection type.

davaya commented 1 year ago

JSON (#430) is also a flat list of elements because the purpose of serialization is to put chunks of the graph into documents, not for documents to model graphs. A document is a forest (multiple trees) and the root of each tree is anElementCollection type (plus disconnected elements not referenced by any collection in the document). That's just the first graph layer, the second is created by Relationship elements.

Serialization (and thus documents) shouldn't care about graphs. After parsing a document there will be zero, one, or more ElementCollections, and those are where you start. If collections are nested topo sort will find the root(s). Whatever is left over are "leaves on the ground" :-).

goneall commented 1 year ago

JSON (#430) is also a flat list of elements because the purpose of serialization is to put chunks of the graph into documents, not for documents to model graphs. A document is a forest (multiple trees) and the root of each tree is anElementCollection type (plus disconnected elements not referenced by any collection in the document). That's just the first graph layer, the second is created by Relationship elements.

Serialization (and thus documents) shouldn't care about graphs. After parsing a document there will be zero, one, or more ElementCollections, and those are where you start. If collections are nested topo sort will find the root(s). Whatever is left over are "leaves on the ground" :-).

In my opinion, this is much more complicated than the 2.X serialization where we always have a top level with a description of where to start. Although it may be technically possible to find the trees, it doesn't seem very easy. It seems like it is left to the consumer of an SPDX serialization to reverse engineer what the producer intended to send rather than the being explicitly stated (as in SPDX 2.X).

maxhbr commented 1 year ago

One possibility is to restrict JSON-LD serializations to one and only one ElementCollection type.

That sounds fairly restrictive

goneall commented 1 year ago

The use case I'm concerned about is:

A software supplier produces a file containing an SBOM representing the specific package/version being distributed
A consumer of the software would like to read the SBOM

I believe the use case above is extremely common and it needs to be very straight forward.

Parsing the entire graph is very complicated in my opinion.

Are there other possible solutions?

maxhbr commented 1 year ago

I agree, we should make that use case easier and maybe sacrifice some of the convenience that the graph based side currently has.

Right now there is not even a easy way to read a provided serialized SPDX blob and find out what kind of Document/SBOM it is and what to expect from its content ("is it an SBOM for the component I am interested in?"). This information currently needs to be conveyed externally, e.g. in the file name.

goneall commented 1 year ago

Discussed on serialization call 14 September 2023:

What is the difference between rootElement property in collections and the describes relationship? opened https://github.com/spdx/spdx-3-model/issues/495
rootElement property matches the semantics of "where to start"
The proposed solution won't work since it disallows multiple SBOM's in the same Payload
We agreed that we have to support multiple ElementCollections within a Payload
Proposal in the serialization specification a Payload must contain a single ElementCollection at the top level
Proposal2 - in the serialization specification a Payload must contain a single Element at the top level - if more then one element needs to be described - question on how we handle multiple elements:
- sub-proposal1 - the top level element is always an ElementCollection. The top level ElementCollection can also contain inner or sub ElementCollections

General consensus on call - Proposal2 with sub-proposal1

zvr commented 1 year ago

Discussed on serialization call 14 November 2023

@goneall I knew that the serialization calls produced great work; I had not realized that they are so advanced that they happen two months in the future! 😉

goneall commented 1 year ago

Oops - fixed the date.

goneall commented 1 year ago

I just realized that the solution discussed on 14 September will not work for RDF which typically stores a graph without an implied hierarchy.

One solution we discussed on the call was using a specific type to denote the "wrapper" of the payload. This could be the "X" class.

davaya commented 1 year ago

NamespaceMap's Solution B is serialized in JSON-LD using context. It would be bad to force every serialized value to include an unnecessary Solution A "Y-collection", "X-collection" or other wrapper element, particularly for the common case where the serialized value has a single ElementCollection (Sbom).

The "where do you start" in a hierarchy of Sboms can be answered by rootElement - it would be populated with just the top-level Sbom(s), not any Sboms or Bundles that they reference.

So if namespaceMap can be serialized using JSON-LD context, can't rootElement also be serialized that way, without adding an artificial wrapper element to the graph?

goneall commented 1 year ago

@davaya - I created a reply to your comment above in a namepsaceMap pull request comment.

I looked through the JSON-LD token keywords and didn't see anything we could obviously use for a rootElement. If there was such a keyword, it would solve the problem for JSON-LD. Let me know if you see any candidates.

If we choose a type wrapper as proposed above, we would need to distinguish that type from a copy of an "X-Collection" which is not intended to be the the starting point of this serialization - so I keep coming back to needing two types:

A general "X-Collection" that represents the "original" serialization of the elements listed in the "X-Collection"
A specific "Y-Collection" which represents the elements serialized in "this" serialization

I don't like having 2 types - seems more complex - but it is the only way I can think of finding the rootElements when we must serialize all the elements in a graph.

The other problem with the type solution is you would create the "Y-Collection" when initially serializing, but refer to the same class as an "X-Collection" type downstream when the original "Y-Collection" is not the target for the downstream serialization.

If we make the "Y-Collection" a subclass of the "X-Collection" it may work in the model, but it still feels confusing to me.

goneall commented 1 year ago

Let me throw out one more alternative.

We could have a special type to give us the starting point ("Y-Collection"), but instead of being a subclass of the "X-Collection", it would contain one and only one property that would refer to one and only one "X-Collection". The "Y-Collection" would not be part of the model, it would be part of the serialization spec for the sole purpose of communicating where to start.

davaya commented 1 year ago

Discard the last sentence of my suggestion - it's wrong. There doesn't need to be a special JSON-LD keyword for rootElement - it is just a normal property of a normal ElementCollection instance that can be serialized like all other element values.

As discussed in #491, I believe the starting point of a subgraph of Sboms can be a property of ElementCollection (Bundle, Bom, Sbom) without needing to define separate X-Collection or Y-Collection types. That is now possible after Bob defined X-Collection to be a specific collection of elements rather than an "intent" to be applied to future collections created by the same or different producers.

Let's try some examples.

davaya commented 1 year ago

The "Y-Collection" would not be part of the model, it would be part of the serialization spec

This is on the right track, recognizing that the serialization spec is a data model separate from the logical model. With that understood, Y-Collection isn't actually needed. Visualize elements as playing cards in a deck - the producer can sort and wrap the deck any way he wants, and could put the jokers (starting points) on top. Once the deck is unwrapped (parsed into logical elements), the jokers are still marked as jokers by a property that is in the logical model (rootElement), and can be re-serialized at the top of the file again.

If the producer doesn't serialize the deck with root elements on top, the consumer can still recognize them in the middle of the deck while parsing and mark them with the little colored "sign here" tabs used for contract paperwork.

sbarnum commented 1 year ago

Solution A for NamespaceMap was intended to address the namespacemap issue, this issue, and the issue of relating a serialized file to a serializable collection in a single clean approach.

I believe the cleanest, most appropriate solution would be for there to be a SerializableCollection class (SolutionA X-Collection) that would be part of a serialized payload to give a starting point as well as convey producer suggestions, and a SerializedCollection subclass of SerializableCollection (SolutionB X-Collection) that would be created by the consumer of a serialized payload and capture what was actually in the payload. This combination would very simply address the "where to start" question including cleanly handling layered and multi-peer collections. It would also support the targeted use cases for both SolutionA and SolutionB on the namespacemap question and provide a simple way to validate that what was actually serialized matches what was asserted as serialized. It also would work cleanly with any serialization format and would support provenance of serialization/deserialization exchange throughout the ecosystem.

I do not believe it would be appropriate or practical to restrict a payload to only one ElementCollection type or instance. That would not support SPDX 3.0 required use cases.

goneall commented 1 year ago

I was going to update https://github.com/spdx/spdx-3-model/blob/main/serialization/json_ld/examples/converted_from_spdx_2.json with the results of our meeting and I realized if we named the “SerializeableCollection” “SpdxDocument”, it’s already updated to our decisions.

While reviewing, I realized we have the same inconsistency in actual serialization and the serialized model object that we had with namespaces. If the “element” property represents everything serialized AND if the “SerlializeableCollection” is serialized in the collection itself AND the serialization method makes it clear which elements are included in the serialization (which I think all file based serializations do), then you can get the list of elements serialized out of the serialization itself. This leads to the possible inconsistency where what is actually serialized is different from what is in the “element” property of the “SeraializableCollection”.

I can think of a couple approaches to handling the inconsistency while continuing to support the decisions made in the serialization meeting on 2023 Nov 21:

Have a rule that IF the serialization format natively supports the "element" property by clearly delineating which elements are included in the serialization, the "element" property is NOT serialized as part of the "SerializableCollection" and on deserialization, the actual elements in the serialization are used to repopulate the "element" property of the "SerializableCollection".
Invalidate any serialization where the "SerializableCollection" element disagrees with what is actually serialized

Option 1 has the advantage of not duplicating the list of elements if only one SBOM is in the serialization - the issue Max raised on our call - while still retaining the element property in the model which should satisfy the concern raised by Sean on making the list of elements explicit in the model. It also has an advantage that no serialization can possibly contain the inconsistency.

Option 2 has the advantage of easily serializing and deserializing the "SerializeableCollection" since the serialization matches the model. It also has an advantage that any inconsistency between what the creator of a serialization intended and what they actually did could be checked.

My current opinion is to go with option 1 since it is simpler to avoid inconsistencies than to allow them and have rules to check and verify.

@sbarnum @maxhbr @davaya @nishakm - thoughts?

goneall commented 1 year ago

In looking a bit closer at the element array in the converted_from_spdx_2.json, I think there is a change we need to make. Rather than it being a list of strings, it should be a list of ID's - e.g. change:

     "element": [
        "spdx-example:SPDXRef-Actor-LicenseFind-1.0",
        "spdx-example:SPDXRef-Actor-ExampleCodeInspect",
        "spdx-example:SPDXRef-Actor-JaneDoe",
...

to:


     "element": [
        {"spdxId": "spdx-example:SPDXRef-Actor-LicenseFind-1.0"},
        {"spdxId": "spdx-example:SPDXRef-Actor-ExampleCodeInspect"},
        {"spdxId": "spdx-example:SPDXRef-Actor-JaneDoe"},
...

davaya commented 1 year ago

@goneall:

Have a rule that IF the serialization format natively supports the "element" property by clearly delineating which elements are included in the serialization, the "element" ...

I submitted PR #500 to define the B portion of the solution in the model: every serialization format must natively support the element property because in serialized data that's what element values are. That is the payload's one and only property unless namespaceMap (and/or the proposed creationInfoMap) are present and used to compact the bytes.

When a consumer parses the bytes, the consumer gets element instances. The value of the element property is the element instances, not a separate list of IRIs that could disagree with the elements.

The consumer also gets any non-element data that the producer included in the bytes: namespaceMap and/or creationInfoMap used to de-compact the bytes. The consumer may also use that non-element data in new serialized bytes and/or include it in new elements produced by consumer.

davaya commented 1 year ago

@sbarnum:

I do not believe it would be appropriate or practical to restrict a payload to only one ElementCollection type or instance. That would not support SPDX 3.0 required use cases.

A serialized bytes instance (payload) is not restricted in any way, I agree that it may contain zero, one, ten ElementCollection instances, and if you want to define X-Collection or Y-Collection as new types, any number of them as well. I am concerned only with serialization and ensuring that it is independent of the element graph and can carry any connected or disconnected subsets of the graph.

Do we all agree that:

A payload may contain two file elements.
A payload may contain two file elements and an Sbom listing those two files
If the model defines an X-Collection type separate from ElementCollection, a payload may contain two file elements, an Sbom listing those two files, and an X-Collection listing the Sbom and two files
If there is an X-Collection listing the Sbom and two files, a payload may contain the X-Collection alone, the Sbom alone, one of the files alone, any two of those elements, any three of those elements, all four of those elements, or all four of those elements plus some other elements of any type?

goneall commented 1 year ago

{"spdxId": "spdx-example:SPDXRef-Actor-LicenseFind-1.0"},

From discussion with Sean - we can include the ID type in the context for "element"

davaya commented 1 year ago

@goneall @sbarnum @maxhbr

From Thursday's minutes:

The "SerializableCollection" and "SerializedCollection" classes work in concert to express what 'should' but in the serialized content and what 'is' in the serialized content and are machine comparable to provide clear verification.

I agree with the first part of the statement. But verification as a rationale doesn't apply. A manifest in a shipping box is perfectly clear, the box can have items missing and the manifest identifies what should be present. But when a producer serializes a set of elements, the serialized data is no longer a box that things can go missing from, it is a single epoxied blob of elements. Verifying the blob verifies that no elements are modified or missing without using a manifest.

If the consumer reads the same blob that the producer serialized, then the manifest is redundant. If the consumer doesn't read the same blob that the producer serialized, then a manifest, if included in the blob, can also be modified from the producer's manifest. So in addition to being redundant, the manifest does not support verification.

The only verification use case would be for the producer to not include the manifest in the serialized data but for consumers to obtain it separately. That would support verification only if the manifest has integrity and the serialized data does not.

goneall commented 11 months ago

This is resolved with the recent serialization pull requests

spdx / spdx-3-model

Where do we start in a serialized JSON-LD? #478