Open maxhbr opened 2 years ago
For license expressions at least, parsing according to a grammar with defined binding priority, operator commutativity, distributive property, paren insertion/deletion policy, etc, would always parse equivalent expressions into the same AST. Then serializing would always pick a single canonical serialization of the AST. For convenience we could pick whatever the most popular correct tool does (sort operands of commutative operators lexically, always use undistributed form, never use (or always use) redundant parens, delete all whitespace around non-whitespace-delimited tokens, etc. as the canonical serialized string.
The canonical value is hashed or signed, and inputs are canonicalized before comparing hashes or validating signature. There should never be a reason to preserve an original string. If a reason is discovered, that is proof that the canonicalization algorithm is underspecified.
For Relationship types, SPDX should unambiguously specify which pairs are inverses (have identical semantics if the type and operand order are both flipped). But because the Relationship Element is asymmetric (from 1 to many), deleting one type from each inverse pair could cause the number of Relationship instances to explode. If inverse pairs remain but canonicalizing uses only one of each pair, the canonicalized data could similarly explode from one Relationship to thousands.
For Relationship types, SPDX should unambiguously specify which pairs are inverses (have identical semantics if the type and operand order are both flipped).
There are some serializations of values that should parse to an equal object and thus should serialized to one canonical representation. This is basically just a place to dump all the potential edge cases that came up to my mind in discussions.
My assumption is that all (or at least most) of the examples listed here are automatically resolved at least by some libraries. So canonicalization would prevent these libraries from being used or would require manual "keeping track how it was on the input side".
License Expressions
MIT
the following license expressions are all equal and any tool might silently replace them with
MIT
:MIT
MIT AND MIT
MIT OR MIT
MIT AND NONE
MIT OR NONE
(MIT)
( MIT )
((MIT))
(MIT AND MIT)
(MIT OR MIT)
MIT AND (MIT)
MIT OR (MIT)
MIT AND BSD-3-Clause
MIT AND BSD-3-Clause
is equal toBSD-3-Clause AND MIT
(MIT OR BSD-3-Clause) AND BSD-2-Clause
(MIT OR BSD-3-Clause) AND BSD-2-Clause
is equal toMIT AND BSD-2-Clause OR BSD-2-Clause AND BSD-3-Clause
PURLs
examples from the purl-spec
sorting of qualifiers
pkg:rpm/fedora/curl@7.50.3-1.fc25?arch=i386&distro=fedora-25
is equal topkg:rpm/fedora/curl@7.50.3-1.fc25?distro=fedora-25&arch=i386
CPEs
cpe:2.3:a:apache:commons_io:2.8.0:*:*:*:*:*:*:*
is eqaul tocpe:2.3:a:apache:commons_io:2.8.0:*
(AFAIK)Relations
Some relation types have inverse
DESCRIBES
is inverse ofDESCRIBED_BY
A DESCRIBES B
is equal toB DESCRIBED_BY A
Lists in relations?
A DESCRIBES [B,C]
is equal toA DESCRIBES [B], A DESCRIBES [C]
Some relation types are inverse of itself
COPY_OF
is the inverse of itself:A COPY_OF B
is equal toB COPY_OF A
Some relations are not fully specified
OTHER
have an inverse? Is it inverse to itself?Optional percent-escaping in URLs and IRIs
tbd.
Strings and escaping
optionally escaped unicode symbols
tbd.
Different ways of escaping unicode
tbd.
File Paths
/test/../test/file
is equal to/test/file
test/../test/file
is equal totest/file
.././test/../test/file
is equal to../test/file
More esoteric...
Paths in non-case sensitive file systems
Email addresses and case sensitivity