w3c / IFT

W3C Incremental Font Transfer
Other
21 stars 11 forks source link

Switch ID encoding to base32hex #169

Closed skef closed 4 months ago

skef commented 4 months ago

As discussed in #167

Not sure the advice about bytes to encode is sufficient or clear enough. That's supposed to match the advice on omitting leading hex zeros, but the issue is more complicated with base32hex because you wind up with some leading zeros anyway.


Preview | Diff

skef commented 4 months ago

I think I must be using a different bikeshed, as my diffs are a lot larger. I did a fresh install before running make.

garretrieger commented 4 months ago

I think I must be using a different bikeshed, as my diffs are a lot larger. I did a fresh install before running make.

Mine might be out of date, I'll try updating.

garretrieger commented 4 months ago

I think I must be using a different bikeshed, as my diffs are a lot larger. I did a fresh install before running make.

Mine might be out of date, I'll try updating.

Ok yeah it was, just made a PR to update Overview.html to be built with the latest version: https://github.com/w3c/IFT/pull/170

skef commented 4 months ago

@garretrieger It seems like another option that would address the (non-editorial) suggestions here would be to include the base32hex padding after all and specify that the ID characters be relative to the last non-padding character. (And then also add the point about UTF-8). What do you think?

garretrieger commented 4 months ago

@garretrieger It seems like another option that would address the (non-editorial) suggestions here would be to include the base32hex padding after all and specify that the ID characters be relative to the last non-padding character. (And then also add the point about UTF-8). What do you think?

The compactness of base64 is very desirable when encoding opaque binary data, especially given that there are URL length limits. Conversely in the numeric id case the padding would add several extra bytes to most of the URLs since base32 pads up to the 40 bit boundary. Ultimately we have two different use cases (file based w/ numeric ids vs dynamic with string ids) with different requirements. Each of which is best served by a different encoding so I think it makes the most sense to just have both available. We can add an explicit warning to avoid using the base64 encoding when then patches will be stored in a filesystem if that's a concern.

skef commented 4 months ago

I'll update this once #171 is sorted

skef commented 4 months ago

I've updated this, but also made it into a draft for the time being.

Issues:

skef commented 4 months ago

The endian-ness of Hexidecimal is a settled issue -- bytes are encoded as big-endian. But I'm not sure it's as clear with base32hex, so maybe we we should make that explicit when talking about the integer id encodings.

garretrieger commented 4 months ago

The endian-ness of Hexidecimal is a settled issue -- bytes are encoded as big-endian. But I'm not sure it's as clear with base32hex, so maybe we we should make that explicit when talking about the integer id encodings.

Sounds good, reading through the base32 rfc it converts an input which is a list of bytes, so we will need to explain how a numeric ID is converted to a list of bytes. Something like: Convert the numeric ID to a big endian 64 byte unsigned integer encoding. Remove all leading bytes that are equal to 0.

garretrieger commented 4 months ago

I've updated this, but also made it into a draft for the time being.

Issues:

  • What we're calling base64url, at least in the linked RFC, still uses "=" as padding, and underscore is already taken as an alternative, so we'll have to think about that.

"=" will be converted to % encoding during template substitution so default base64url will work, just use more characters than strictly needed to represent padding. I'm leaning towards just using it as is and letting encoders work around it if they want (for example by making sure the ID strings are always in multiples of 24 bits). An alternative would be to define one of the remaining unreserved characters ("." or "~") as the padding character.

  • I'm not sure the distinction between converting to UTF-8 for base32hex and encoding the raw string for base64url is right. (Not sure it's wrong either, we should talk about it.)

I suggest we always use the raw ID string bytes for both base32hex and base64url, and remove the text later on in the specification which says ID strings are UTF8 encoded.

  • Maybe if we're adding base64 we should just stick to hex for IDs after all, to remove the padding issues in those cases? I can see arguments both ways.
  • Someone should check my work in the examples.
skef commented 4 months ago

Sounds good, reading through the base32 rfc it converts an input which is a list of bytes, so we will need to explain how a numeric ID is converted to a list of bytes. Something like: Convert the numeric ID to a big endian 64 byte unsigned integer encoding. Remove all leading bytes that are equal to 0.

In that case, maybe we should have the rule for ID conversion in the URL match the rule for storage in the patches (or, at least, what that was in IFTB, I haven't checked if it's changed): If the highest ID is < 256 convert the byte, if it's > 256 convert to a network order u16 and encode that.

garretrieger commented 4 months ago

Sounds good, reading through the base32 rfc it converts an input which is a list of bytes, so we will need to explain how a numeric ID is converted to a list of bytes. Something like: Convert the numeric ID to a big endian 64 byte unsigned integer encoding. Remove all leading bytes that are equal to 0.

In that case, maybe we should have the rule for ID conversion in the URL match the rule for storage in the patches (or, at least, what that was in IFTB, I haven't checked if it's changed): If the highest ID is < 256 convert the byte, if it's > 256 convert to a network order u16 and encode that.

Format 2 doesn't have a specific byte width since it relies on deltas so I think probably best to just use a variable number of bytes. The current text you have "When the id is an unsigned integer only the significant bytes are encoded. (For example, when the integer is less than 256 only one byte is encoded.)" plus a mention that they are big endian ordered should be sufficient.

garretrieger commented 4 months ago

Sounds good, reading through the base32 rfc it converts an input which is a list of bytes, so we will need to explain how a numeric ID is converted to a list of bytes. Something like: Convert the numeric ID to a big endian 64 byte unsigned integer encoding. Remove all leading bytes that are equal to 0.

In that case, maybe we should have the rule for ID conversion in the URL match the rule for storage in the patches (or, at least, what that was in IFTB, I haven't checked if it's changed): If the highest ID is < 256 convert the byte, if it's > 256 convert to a network order u16 and encode that.

Format 2 doesn't have a specific byte width since it relies on deltas so I think probably best to just use a variable number of bytes. The current text you have "When the id is an unsigned integer only the significant bytes are encoded. (For example, when the integer is less than 256 only one byte is encoded.)" plus a mention that they are big endian ordered should be sufficient.

Would probably be good to add an example of a number that requires more than one byte in the example table.

skef commented 4 months ago

I think the last commit is up to date with the feedback.

garretrieger commented 4 months ago

I think the last commit is up to date with the feedback.

Looks good, I just added a commit to remove references to ID strings being UTF-8 encoded in the format 2 section.