Mapping the namespaceMap to serialization specific namespaces

goneall commented 1 year ago

This came up in the serialization call on 15 June 2023.

The ElementCollection class has a namespaceMap which is used to expand/compact URI's used as identifiers within the ElementCollection.

Different serialization formats have different "native" mechanisms for expanding and compacting the URI's. Since ElmenetCollection can contain other ElementCollections all the serializations need to support nesting of the namespace mappings. We will need to document how this works for each of the serialization formats we decide to support:

[ ] JSON-LD
[ ] ... other formats TBD ... (don't read anything into your favorite format not being listed, we'll add them once the serialization group formally decides it will be supported in SPDX 3.0)

goneall commented 1 year ago

Some additional context.

On the call, I incorrectly stated that that the serialization formats didn't support nesting. on further research, it is possible to nest these. In JSON-LD - you can attach a context to a specific property as described here. For XML, the namespaces are properties of the XML element which can be nested (see this description).

It looks like there is an implementation issue in that not many of the RDF libraries support nesting namespaces - you can only declare the namespace for the entire model. This may be solvable to serializing each ElementCollection independently then somehow manually merging them. Not very convenient - but possible.

For serialization formats that do not natively support namespaces, we will need to define how these namespaces work. Some of the data modeling @davaya worked on may be helpful here.

maxhbr commented 1 year ago

It looks like there is an implementation issue in that not many of the RDF libraries support nesting namespaces [...]

So that would reduce the already small set languages with Libraries even further? Is then only Java left? We should avoid using such complex and niche features.

armintaenzertng commented 1 year ago

Conceptually, you can define local contexts for each ElementCollection individually. Thus, you can encode the namespaceMap locally for each Collection in the local context. See here for a general example of how this local context would work.

HOWEVER, we would lose the context and thereby the NamespaceMap during parsing, so we would probably still need to retain the property namespaceMap, resulting in duplicated information in the serialized file.

Also, as Gary and Max have pointed out, implementing a local context might be non-trivial. Doing a quick search regarding rdflib for python I came up with nothing so far.

armintaenzertng commented 1 year ago

Are there plans to utilize the NamespaceMap for properties like suppliedBy or to (from Relationships), which reference other Elements via their spdxId? If yes and we don't support nesting, this would mean that the local context has to be repeated on every single Element that uses it. This might lead to clashing when the Element belongs to more than one Collection and you have to merge NamespaceMaps. I feel this will spiral out of control and lead into implementation hell.

goneall commented 1 year ago

Just FYI - I'm trying to implement the namespace mapping in the Java library and running into several implementation issues. We should think about primarily supporting the native namespace features of the serializations. We would loose the ability to round-trip between formats preserving the original namespace definitions.

One other thought is to have the namespaceMap be informative and the native serialization mapping be the primary means of capturing namespace mapping. The serialization defined namespaces would be what is used to deserialization the document and the namespaceMap could be used as a hint when serializing but it would only apply to elements within the Collection.

maxhbr commented 1 year ago

[...] but it would only apply to elements within the Collection.

So no shortening for the IDs in the to list of a Relationship?

sbarnum commented 1 year ago

Here are a few thoughts on the discussion above.

With respect to the "nesting" question:

The scope of any namespace map is local to the Collection on which it is specified. This means that if Collection1 had three elements then the prefixes and namespaces specified within the namespace map in Collection1 would apply to Collection1 itself and the three elements (and would assert no impact on ANY content outside of Collection1 and its contained/referenced elements). If one of the three elements was itself another Collection (Collection2) that itself specified a namespace map then the prefixes and namespaces specified within the namespace map in Collection1 would apply to Collection1 itself and the two contained/referenced elements other than Collection2 AND the prefixes and namespaces specified within the namespace map in Collection2 would apply to Collection2 itself and any of its contained/referenced elements. The implementation mechanism in json-ld for this sort compaction is to specify the prefixes within a context. While a json-ld context can be specified for an entire graph of serialized content, one can also be specified specifically for any object which would be the approach taken for namespace maps. This avoids the perceived "nesting" issue.
What we are discussing here is json-ld Expansion. See numbered list item 3 in the list in section 5.1.2 which explicitly states the local scoping of contexts as I have described above.

With respect to mentions above that seem to imply that the namespace content is only a model structure and would not be part of any serialized content:

As part of the modeled content for Collections, the namespace map properties and content should be explicitly expressed as part of any instance content. This means that any given serialization (e.g., json-ld) should contain both the explicit expression of the namespace map AND the implementation of the specified namespace map content as appropriate for the given serialization. The instance content should not contain ONLY the implementation of the namespace map content and leave the map itself out as the inclusion of the map itself is highly useful for conversion, deserialization/reserialization between serializations.

With respect to what content that prefixing would apply to:

What we are talking about here is basically just string macro compaction/expansion. Some serialization formats provide the capability natively while others may not.
Prefixing specified in a namespace map should be able to be applied to property keys (e.g., keys in json) or IRIs (both object IDs as well as formal IRIs for model classes, properties, etc.). It should not matter which properties in which classes, etc.

With regard to questions around what to do in serialization formats that do not provide any sort of compaction/expansion prefixing:

If a particular serialization precluded the use of namespace prefixes, that serialization should likely simply ignore the namespace map and provide full un-prefixed strings.

armintaenzertng commented 1 year ago

Thanks for the clarification, @sbarnum! I still see two problems:

In your comment you write "contained/referenced" as if this doesn't matter for the context expansion. But it does, as the following examples show: Here is one where the element https://local.namespace#Package1 is contained in the collection and thus subject to the local context. Have a look at the N-Quads to see that in the expanded result the ex:File1 has become https://local.namespace#File1. Here is the same example but this time the element https://local.namespace#Package1 is only referenced. The expanded result does not expand the ex:File1.

With the decision that we do not allow inlining/containing of Elements in SPDX (which would be the first example above), we are left with the case in the second example. This means that in order to utilize the context/NamespaceMap of the Collection in referenced Elements, we have to include the context in every single Element. For large collections with long namespaceMaps this would result in awesome amounts of duplicated content in serialized files, something we were trying to avoid already with CreationInfo...

Even if we go with the context duplication described above, there is still the "degenerate case", as you called it, of an Element being part of multiple collections with conflicting NamespaceMaps.

davaya commented 1 year ago

In a comment on #306, I believe namespaceMap should be removed from the model entirely, and as Gary suggests, make it an optional property of SpdxDocument only. Even that is optional - SpdxDocument can have any attribute, like phaseOfTheMoon, that is not necessary for serialization.

namespaceMap appears exclusively in serialized data because its sole purpose is to optimize serialization. Once the optimization is removed (like unzipping a file), the internal opaque details of how the optimization was performed disappear.

@maxhbr had a use case of lawyers looking at a tag-value file and wanting to re-use namespaceMap across files. But that doesn't seem beneficial - the lawyers are looking at the prefixes used in the file in front of them with {"sp": "http://foo.bar"}, but if they are looking at a different payload it will have a different namespaceMap: {"sp": "http://alpha.baz"}. It will be up to the creators of the two files to agree on a common string "sp" for their different spdxId prefixes.

There isn't a "degenerate case" if namespaceMap is expunged from the logical model - an Element can be in multiple collections because collections don't have it.

goneall commented 1 year ago

This has been resolved with PR #411

spdx / spdx-3-model

Mapping the namespaceMap to serialization specific namespaces #390