iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
97 stars 29 forks source link

Align digest-value grammar with base16/32/64 alphabets #48

Open wumpus opened 5 years ago

wumpus commented 5 years ago

1.0 and 1.1 specify

labelled-digest = algorithm ":" digest-value

and digest-value is a token. "/" and "=" are not valid characters for a token. "/" is in the usual base64 encoding, and "=" is commonly used for padding.

ato commented 5 years ago

Good catch. While the examples and most implementations use base32 (which doesn't include "/") the padding character for base32 is also "=" so it's indeed a problem there too.

@wumpus, so that we can turn this issue into a change proposal for WARC 1.2 is there a better definition for digest-value you'd like to propose?

wumpus commented 5 years ago

https://tools.ietf.org/html/rfc4648 is kind of hand-waving but the union of all of the recommended schemes is

A-Za-z0-9/+-_=

Percent encoding is mentioned once and ~. are mentioned but are argued against, so it's not clear if they are allowed or not. It's as if the RFC was written to be non-normative.

wumpus commented 5 years ago

This is also a 1.0/1.1 erratum, not just a proposal for the future.

wumpus commented 4 years ago

This issue should be labeled with the "WARC/1.1-possible-errata" label @ato

ato commented 4 years ago

Ah yes, good point