Closed skef closed 3 months ago
I suppose the lack of a proper specification is a strike against Base32, although not a definitive one.
The uuencoding standard avoided case problems but used slash, which is preferable to avoid. I did a bit searching around but didn't turn up an obvious choice that's already in the can.
Looks like Base32 is standardized in RFC 4648: https://datatracker.ietf.org/doc/html/rfc4648 . Don't know if that's official enough.
Looks like Base32 is standardized in RFC 4648: https://datatracker.ietf.org/doc/html/rfc4648 . Don't know if that's official enough.
Yeah, we can reference that document. We used it in the previous patch subset version of the spec.
I've been thinking about id URL encoding. What do you think about 5 additional variables, corresponding to the current ones, for Base32 encodings of the ID? Like bd, b1, ... b4 or whatever? Base32 is more or less guaranteed to be compatible and collision-less on every filesystem and would seem to eliminate those problems. Encoders that use string IDs could still have the option of using the strings directly.
Or if "seeing" the ID in the filenames isn't a requirement, I suppose we could alternatively just switch to Base32 entirely, as it's more compact than hex and spreads the files out among more, but still not an overwhelming number, of directories.
Base32 sounds fine to me. If we use it though, I'd be in favour of replacing the hex representation since hex and base32 ultimately both solve the same problem. The only thing of minor concern is that we'll want to specify that padding characters are not used since the padding character "=" is not in the unreserved set and will get percent encoded. We will also need to define a different url safe padding character that's not part of the base32 alphabet for use in the d1-d4 variables when a character isn't present.
_ would be a good choice for the alternate padding character.
Yeah, I agree that just consistently using Base32 sounds good.
An encoding that is both URL and filesystem (e.g. case-insensitive) safe seems like a desirable thing. Would it make sense to throw a modified Base32 spec into a separate doc, or a doc with similar contents, so that it can be referred to elsewhere? Or would an approved IFT spec containing such a section serve that function, at least within w3c?
Actually, on reflection I think the ideal for this is "base 32 with extended hex alphabet" from that RFC, with _ substituting for = as the padding. That way chunk files will continue to appear in sort order. Let's do that.
(Well, hmm, maybe approximate sort order given the lack of byte alignment. Still, it seems easier to mentally process than traditional Base32, which seems to be optimized for a lack of visual aliases (no 0 and 1).)
I'll make a PR later today that attempts to get the references right.
base32 hex sounds good. We can reference that spec and then just include any changes needed (eg. switch of padding character) in the IFT spec.
On the padding:
1) Do we need it? Seems like the padding mostly plays a role in decoding and it's not clear we need to go backwards. 2) If we do need it, I'm presuming d1 will be the last non-padding characters in the encoded string.
For padding what I'm thinking we should do is:
Sounds good, I'll write it this way.
This was merged
I've been thinking about id URL encoding. What do you think about 5 additional variables, corresponding to the current ones, for Base32 encodings of the ID? Like bd, b1, ... b4 or whatever? Base32 is more or less guaranteed to be compatible and collision-less on every filesystem and would seem to eliminate those problems. Encoders that use string IDs could still have the option of using the strings directly.
Or if "seeing" the ID in the filenames isn't a requirement, I suppose we could alternatively just switch to Base32 entirely, as it's more compact than hex and spreads the files out among more, but still not an overwhelming number, of directories.