spdx / spdx-3-model

The model for the information captured in SPDX version 3 standard.
https://spdx.dev/use/specifications/
Other
70 stars 45 forks source link

Serialization needs property and enumeration ids #51

Open davaya opened 1 year ago

davaya commented 1 year ago

While the logical model is intended to be independent of serialization, serialized data needs to be defined in a way that can be hashed consistently. For example: Class files define properties:

SPDX-License-Identifier: Community-Spec-1.0

# Annotation

## Summary

An assertion made in relation to one or more elements.

## Description

An Annotation is an assertion made in relation to one or more elements.

## Metadata

- name: Annotation
- SubclassOf: Element
- Instantiability: Concrete

## Properties

- annotationType
  - type: AnnotationTypeVocab
  - minCount: 1
  - maxCount: 1
- contentType
  - type: MediaType
- statement
  - type: xsd:string
  - minCount: 0
  - maxCount: 1
- subject
  - type: Element
  - minCount: 1
  - maxCount: 1

The properties logically have no ordering, but when serializing the order matters:

{ "subject": "http://acme.com/sboms/1948294/package59", "annotationType": "REVIEW", "statement": "Awesome!" }

is a different serialized value than:

{"statement": "Awesome!", "annotationType": "REVIEW", "subject": "http://acme.com/sboms/1948294/package59" }

and a different value than:

[ "http://acme.com/sboms/1948294/package59", "REVIEW", "Awesome!" ]

even though all are equivalent JSON serializations of the identical annotation.

PROPOSAL: Add an id field to all Property definitions to enable the model files to be the single source of truth for both the logical model and the information/serialization model:

## Properties

- subject
  - type: Element
  - minCount: 1
  - maxCount: 1
  - id: 1
  - link: true
- annotationType
  - type: AnnotationTypeVocab
  - minCount: 1
  - maxCount: 1
  - id: 2
- statement
  - type: xsd:string
  - minCount: 0
  - maxCount: 1
  - id: 3
- contentType
  - type: MediaType
  - id: 4

The id field serves several purposes in a serialization model: 1) as a column number/position when serializing as table rows 2) as a compressed property name when serializing properties and enumerated values in concise data formats

zvr commented 1 year ago

As you correctly point out, @davaya, this is about serialization(s), not the model.

Some (most?) serializations would consider the different orderings to be the same "value", since the underlying data are the same. For example, in the tag-value serialization,

AnnotationType: REVIEW
AnnotationStatement: <text>Awesome!</text>

and

AnnotationStatement: <text>Awesome!</text>
AnnotationType: REVIEW

express the same data.

The canonical serialization has as an additional goal to produce a unique byte sequence. This serialization defines that, for example, when representing an object with properties, all properties appear in sorted order.

The canonical serialization of the example you provided would be:

{"annotationType":"REVIEW","statement":"Awesome!","subject":"http://acme.com/sboms/1948294/package59"}
davaya commented 1 year ago

There is a difference between ordered and unordered lists. For unordered lists a normalized serialization can be defined to be sorded by property name, but order must be preserved when it matters.

In addition, for serializations like spreadsheets and SQL tables a non-alphabetical ordering makes sense for presentation/comprehension even if column position does not affect the meaning. Log files for example put timestamp first even though it isn't alphabetically first.

The id field is ignored in all situations where it doesn't matter, and is essential to support concise serializations (CBOR, Protobuf, Avro, etc) which might be used in performance/bandwidth sensitive applications. Is there any disadvantage with adding it to the model files?

EDIT Decided at the tech meeting: 1) a serialization/information model will be included on the spdx-3-model repo 2) individual SPDX-defined serializations may use parts of the serialization model but are not required to do so.

Its location (integrated into the logical model markdown files or defined in a separate folder) will be determined later when we start work on serialization.

goneall commented 7 months ago

Moving to 3.1 to be considered as part of any canonicalization / simple JSON approach