spdx / spdx-3-model

The model for the information captured in SPDX version 3 standard.
https://spdx.dev/use/specifications/
Other
68 stars 44 forks source link

Mapping the namespaceMap to serialization specific namespaces #390

Closed goneall closed 1 year ago

goneall commented 1 year ago

This came up in the serialization call on 15 June 2023.

The ElementCollection class has a namespaceMap which is used to expand/compact URI's used as identifiers within the ElementCollection.

Different serialization formats have different "native" mechanisms for expanding and compacting the URI's. Since ElmenetCollection can contain other ElementCollections all the serializations need to support nesting of the namespace mappings. We will need to document how this works for each of the serialization formats we decide to support:

goneall commented 1 year ago

Some additional context.

On the call, I incorrectly stated that that the serialization formats didn't support nesting. on further research, it is possible to nest these. In JSON-LD - you can attach a context to a specific property as described here. For XML, the namespaces are properties of the XML element which can be nested (see this description).

It looks like there is an implementation issue in that not many of the RDF libraries support nesting namespaces - you can only declare the namespace for the entire model. This may be solvable to serializing each ElementCollection independently then somehow manually merging them. Not very convenient - but possible.

For serialization formats that do not natively support namespaces, we will need to define how these namespaces work. Some of the data modeling @davaya worked on may be helpful here.

maxhbr commented 1 year ago

It looks like there is an implementation issue in that not many of the RDF libraries support nesting namespaces [...]

So that would reduce the already small set languages with Libraries even further? Is then only Java left? We should avoid using such complex and niche features.

armintaenzertng commented 1 year ago

Conceptually, you can define local contexts for each ElementCollection individually. Thus, you can encode the namespaceMap locally for each Collection in the local context. See here for a general example of how this local context would work.

HOWEVER, we would lose the context and thereby the NamespaceMap during parsing, so we would probably still need to retain the property namespaceMap, resulting in duplicated information in the serialized file.

Also, as Gary and Max have pointed out, implementing a local context might be non-trivial. Doing a quick search regarding rdflib for python I came up with nothing so far.

armintaenzertng commented 1 year ago

Are there plans to utilize the NamespaceMap for properties like suppliedBy or to (from Relationships), which reference other Elements via their spdxId? If yes and we don't support nesting, this would mean that the local context has to be repeated on every single Element that uses it. This might lead to clashing when the Element belongs to more than one Collection and you have to merge NamespaceMaps. I feel this will spiral out of control and lead into implementation hell.

goneall commented 1 year ago

Just FYI - I'm trying to implement the namespace mapping in the Java library and running into several implementation issues. We should think about primarily supporting the native namespace features of the serializations. We would loose the ability to round-trip between formats preserving the original namespace definitions.

One other thought is to have the namespaceMap be informative and the native serialization mapping be the primary means of capturing namespace mapping. The serialization defined namespaces would be what is used to deserialization the document and the namespaceMap could be used as a hint when serializing but it would only apply to elements within the Collection.

maxhbr commented 1 year ago

[...] but it would only apply to elements within the Collection.

So no shortening for the IDs in the to list of a Relationship?

sbarnum commented 1 year ago

Here are a few thoughts on the discussion above.

With respect to the "nesting" question:

With respect to mentions above that seem to imply that the namespace content is only a model structure and would not be part of any serialized content:

With respect to what content that prefixing would apply to:

With regard to questions around what to do in serialization formats that do not provide any sort of compaction/expansion prefixing:

armintaenzertng commented 1 year ago

Thanks for the clarification, @sbarnum! I still see two problems:

In your comment you write "contained/referenced" as if this doesn't matter for the context expansion. But it does, as the following examples show: Here is one where the element https://local.namespace#Package1 is contained in the collection and thus subject to the local context. Have a look at the N-Quads to see that in the expanded result the ex:File1 has become https://local.namespace#File1. Here is the same example but this time the element https://local.namespace#Package1 is only referenced. The expanded result does not expand the ex:File1.

With the decision that we do not allow inlining/containing of Elements in SPDX (which would be the first example above), we are left with the case in the second example. This means that in order to utilize the context/NamespaceMap of the Collection in referenced Elements, we have to include the context in every single Element. For large collections with long namespaceMaps this would result in awesome amounts of duplicated content in serialized files, something we were trying to avoid already with CreationInfo...

Even if we go with the context duplication described above, there is still the "degenerate case", as you called it, of an Element being part of multiple collections with conflicting NamespaceMaps.

davaya commented 1 year ago

In a comment on #306, I believe namespaceMap should be removed from the model entirely, and as Gary suggests, make it an optional property of SpdxDocument only. Even that is optional - SpdxDocument can have any attribute, like phaseOfTheMoon, that is not necessary for serialization.

namespaceMap appears exclusively in serialized data because its sole purpose is to optimize serialization. Once the optimization is removed (like unzipping a file), the internal opaque details of how the optimization was performed disappear.

@maxhbr had a use case of lawyers looking at a tag-value file and wanting to re-use namespaceMap across files. But that doesn't seem beneficial - the lawyers are looking at the prefixes used in the file in front of them with {"sp": "http://foo.bar"}, but if they are looking at a different payload it will have a different namespaceMap: {"sp": "http://alpha.baz"}. It will be up to the creators of the two files to agree on a common string "sp" for their different spdxId prefixes.

There isn't a "degenerate case" if namespaceMap is expunged from the logical model - an Element can be in multiple collections because collections don't have it.

goneall commented 1 year ago

This has been resolved with PR #411