Different URL ID encoding

skef commented 4 months ago

I've been thinking about id URL encoding. What do you think about 5 additional variables, corresponding to the current ones, for Base32 encodings of the ID? Like bd, b1, ... b4 or whatever? Base32 is more or less guaranteed to be compatible and collision-less on every filesystem and would seem to eliminate those problems. Encoders that use string IDs could still have the option of using the strings directly.

Or if "seeing" the ID in the filenames isn't a requirement, I suppose we could alternatively just switch to Base32 entirely, as it's more compact than hex and spreads the files out among more, but still not an overwhelming number, of directories.

skef commented 4 months ago

I suppose the lack of a proper specification is a strike against Base32, although not a definitive one.

The uuencoding standard avoided case problems but used slash, which is preferable to avoid. I did a bit searching around but didn't turn up an obvious choice that's already in the can.

skef commented 4 months ago

Looks like Base32 is standardized in RFC 4648: https://datatracker.ietf.org/doc/html/rfc4648 . Don't know if that's official enough.

garretrieger commented 4 months ago

Looks like Base32 is standardized in RFC 4648: https://datatracker.ietf.org/doc/html/rfc4648 . Don't know if that's official enough.

Yeah, we can reference that document. We used it in the previous patch subset version of the spec.

I've been thinking about id URL encoding. What do you think about 5 additional variables, corresponding to the current ones, for Base32 encodings of the ID? Like bd, b1, ... b4 or whatever? Base32 is more or less guaranteed to be compatible and collision-less on every filesystem and would seem to eliminate those problems. Encoders that use string IDs could still have the option of using the strings directly.

Or if "seeing" the ID in the filenames isn't a requirement, I suppose we could alternatively just switch to Base32 entirely, as it's more compact than hex and spreads the files out among more, but still not an overwhelming number, of directories.

Base32 sounds fine to me. If we use it though, I'd be in favour of replacing the hex representation since hex and base32 ultimately both solve the same problem. The only thing of minor concern is that we'll want to specify that padding characters are not used since the padding character "=" is not in the unreserved set and will get percent encoded. We will also need to define a different url safe padding character that's not part of the base32 alphabet for use in the d1-d4 variables when a character isn't present.

garretrieger commented 4 months ago

_ would be a good choice for the alternate padding character.

skef commented 4 months ago

Yeah, I agree that just consistently using Base32 sounds good.

An encoding that is both URL and filesystem (e.g. case-insensitive) safe seems like a desirable thing. Would it make sense to throw a modified Base32 spec into a separate doc, or a doc with similar contents, so that it can be referred to elsewhere? Or would an approved IFT spec containing such a section serve that function, at least within w3c?

skef commented 4 months ago

Actually, on reflection I think the ideal for this is "base 32 with extended hex alphabet" from that RFC, with _ substituting for = as the padding. That way chunk files will continue to appear in sort order. Let's do that.

skef commented 4 months ago

(Well, hmm, maybe approximate sort order given the lack of byte alignment. Still, it seems easier to mentally process than traditional Base32, which seems to be optimized for a lack of visual aliases (no 0 and 1).)

skef commented 4 months ago

I'll make a PR later today that attempts to get the references right.

garretrieger commented 4 months ago

base32 hex sounds good. We can reference that spec and then just include any changes needed (eg. switch of padding character) in the IFT spec.

skef commented 4 months ago

On the padding:

1) Do we need it? Seems like the padding mostly plays a role in decoding and it's not clear we need to go backwards. 2) If we do need it, I'm presuming d1 will be the last non-padding characters in the encoded string.

garretrieger commented 4 months ago

For padding what I'm thinking we should do is:

For the full base32 value ('id' variable) we don't include padding. This remains compliant since the base32 spec allows it to be omitted where it's not needed (https://datatracker.ietf.org/doc/html/rfc4648#section-3.2).
For d1-d4 if the id string is too short then we use '_' for that variable. Similar to how we specify 0 for the hex representation currently.

skef commented 4 months ago

Sounds good, I'll write it this way.

skef commented 3 months ago

This was merged

w3c / IFT

Different URL ID encoding #167