spdx / canonical-serialisation

SPDX Canonicalisation repo
https://spdx.github.io/canonical-serialisation/
2 stars 1 forks source link

Some thoughts about challenges in normalization #6

Open maxhbr opened 2 years ago

maxhbr commented 2 years ago

There are some serializations of values that should parse to an equal object and thus should serialized to one canonical representation. This is basically just a place to dump all the potential edge cases that came up to my mind in discussions.

My assumption is that all (or at least most) of the examples listed here are automatically resolved at least by some libraries. So canonicalization would prevent these libraries from being used or would require manual "keeping track how it was on the input side".

License Expressions

MIT

the following license expressions are all equal and any tool might silently replace them with MIT:

MIT AND BSD-3-Clause

(MIT OR BSD-3-Clause) AND BSD-2-Clause

PURLs

examples from the purl-spec

purl canonical purl
pkg:GOLANG/google.golang.org/genproto#/googleapis/api/annotations/ pkg:golang/google.golang.org/genproto#googleapis/api/annotations
pkg:GOLANG/google.golang.org/genproto@abcdedf#/googleapis/api/annotations/ pkg:golang/google.golang.org/genproto@abcdedf#googleapis/api/annotations
pkg:bitbucket/birKenfeld/pyGments-main@244fd47e07d1014f0aed9c pkg:bitbucket/birkenfeld/pygments-main@244fd47e07d1014f0aed9c
pkg:github/Package-url/purl-Spec@244fd47e07d1004f0aed9c pkg:github/package-url/purl-spec@244fd47e07d1004f0aed9c
pkg:gem/jruby-launcher@1.1.2?Platform=java pkg:gem/jruby-launcher@1.1.2?platform=java
pkg:Maven/org.apache.xmlgraphics/batik-anim@1.9.1?classifier=sources&repositorY_url=repo.spring.io/release pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?classifier=sources&repository_url=repo.spring.io/release
pkg:Maven/org.apache.xmlgraphics/batik-anim@1.9.1?extension=pom&repositorY_url=repo.spring.io/release pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?extension=pom&repository_url=repo.spring.io/release
pkg:Maven/net.sf.jacob-project/jacob@1.14.3?classifier=x86&type=dll pkg:maven/net.sf.jacob-project/jacob@1.14.3?classifier=x86&type=dll
pkg:Nuget/EnterpriseLibrary.Common@6.0.1304 pkg:nuget/EnterpriseLibrary.Common@6.0.1304
pkg:PYPI/Django_package@1.11.1.dev1 pkg:pypi/django-package@1.11.1.dev1
pkg:Rpm/fedora/curl@7.50.3-1.fc25?Arch=i386&Distro=fedora-25 pkg:rpm/fedora/curl@7.50.3-1.fc25?arch=i386&distro=fedora-25
pkg:/maven/org.apache.commons/io pkg:maven/org.apache.commons/io
pkg://maven/org.apache.commons/io pkg:maven/org.apache.commons/io
pkg:///maven/org.apache.commons/io pkg:maven/org.apache.commons/io

sorting of qualifiers

CPEs

Relations

Some relation types have inverse

Lists in relations?

Some relation types are inverse of itself

Some relations are not fully specified

Optional percent-escaping in URLs and IRIs

tbd.

Strings and escaping

optionally escaped unicode symbols

tbd.

Different ways of escaping unicode

tbd.

File Paths

More esoteric...

Paths in non-case sensitive file systems

Email addresses and case sensitivity

davaya commented 2 years ago

For license expressions at least, parsing according to a grammar with defined binding priority, operator commutativity, distributive property, paren insertion/deletion policy, etc, would always parse equivalent expressions into the same AST. Then serializing would always pick a single canonical serialization of the AST. For convenience we could pick whatever the most popular correct tool does (sort operands of commutative operators lexically, always use undistributed form, never use (or always use) redundant parens, delete all whitespace around non-whitespace-delimited tokens, etc. as the canonical serialized string.

The canonical value is hashed or signed, and inputs are canonicalized before comparing hashes or validating signature. There should never be a reason to preserve an original string. If a reason is discovered, that is proof that the canonicalization algorithm is underspecified.

davaya commented 2 years ago

For Relationship types, SPDX should unambiguously specify which pairs are inverses (have identical semantics if the type and operand order are both flipped). But because the Relationship Element is asymmetric (from 1 to many), deleting one type from each inverse pair could cause the number of Relationship instances to explode. If inverse pairs remain but canonicalizing uses only one of each pair, the canonicalized data could similarly explode from one Relationship to thousands.

maxhbr commented 2 years ago

For Relationship types, SPDX should unambiguously specify which pairs are inverses (have identical semantics if the type and operand order are both flipped).

--> https://github.com/spdx/spdx-spec/issues/744