Discussion: How to handle different deserialization scenarios

goneall commented 3 weeks ago

The serialization documentation has fairly detailed descriptions on how to serialize, but not as much a deserialization approaches and scenarios.

Specifically, it would be good to (decide and) document how the resultant model would be represented in the following scenarios:

Deserializing a JSON-LD file with a single Element
Deserializing a JSON-LD file with multiple Elements and no SpdxDocuments
Deserializing a JSON-LD file with multiple Elements and a single SpdxDocument
Deserializing a JSON-LD file with multiple Elements and multiple SpdxDocuments

How do we handle creating the in-memory SPDX documents in each of these scenarios?

goneall commented 3 weeks ago

For 1. and 2. above, suggest creating an SpdxDocument in memory "on the fly" with all of the Element(s) represented as root elements.

For 3., should we assume the single SpdxDocument represents the serialization information? Is there any validation we could do to confirm this? If we assume it represents the serialization information, then we can augment the serialized SpdxDocument with the information from the file itself to complete the in-memory representation.

Scenario 4. is the most challenging. It's quite likely one of the SpdxDocuments represents the serialization itself - but which one? We would need some way of determining which one is the SpdxDocument - or we treat it the same as not having any SpdxDocument.

JPEWdev commented 3 weeks ago

I can tell you how the shacl2code bindings deal with this. First of all, since they are not SPDX specific, there is no requirement that an SpdxDocument is present. The bindings have a separate concept of a SHACLObjectSet which is the container that represents a set of objects to be serialized/destination for deserialization. It also does some indexing book-keeping (e.g. so you can look up an object by it's ID quickly), and performs "linking" where an object property that is referencing another object by a string IRI will be replaced with a reference to the actual object with that IRI, if it exists in the SHACLObjectSet. In this case, SpdxDocument is actually just a slightly special element handled at higher layers (e.g. the Yocto SPDX code track the SpdxDocument separately, make sure there is only one per SHACLObjectSet etc.).

I really believe that this approach is the right way to go. Don't encumber users with the semantics of SpdxDocuments if they don't want it. It's frustrating for users if they need to (de)serialize 1 or 2 in your examples, but can't because bindings have intertwined the concept of an SpdxDocument with "a set of things to (de)serialize". Code at a higher level can make it easier to deal with SpdxDocument, since that is the common case, but it's a "layer" on top, not the core functionality. The core bindings should avoid enforcing "policy" on users about how they do things and focus on the "mechanism" that enables them to do what they need. The "policy" is the responsibility of a higher level of abstraction that makes life easier for the common cases. If you force policy on the core bindings, you're bindings are not going to be very flexible and you can end up with a lot of weird edge cases needing to be encoded because you made choices for the users they didn't like :)

IOW, with the shacl2code python bindings (and the C++ bindings I'm working on), none of these 4 are a problem at all, since SPDXDocument is not special.

goneall commented 3 weeks ago

@JPEWdev - I think your approach for the lower level language bindings is fine. The libraries I'm writing have to deal with the higher level semantics, hence the need to solve the issue.

The SpdxDocument represents metadata about the serialization itself, and in some scenarios it can be quite important. One example is verifying references to SPDX elements in external files. The information to verify is stored in the SpdxDocument. If we don't know what SpdxDocument contains the metadata, we can't verify the external document.

I'm starting to form the opinion that we need to fix this in the serialization schema - either add an optional property at the root level, or require that only one SpdxDocument can be present in the @graph such that the SpdxDocument data is unambiguous. The former would be a non-breaking change. For the code which doesn't need the meta-data, it can just be ignored.

spdx / spdx-3-model

Discussion: How to handle different deserialization scenarios #860