Closed ramsey closed 2 years ago
Without having dug too deeply into it: the whole thing has 130 bits but you only need 128. So the first two bits should be ignored? But you are using the first 48 bits (0 through 47). Shouldn't you use the first 48 "usable" bits though (2 through 49)?
Thanks to @heiglandreas and ar_turek on Twitter, I figured it out!
Since I'm working in PHP (and I suspect this is the same/similar in many other languages), I can only easily work in bytes and not bits, so decoding a 130-bit string to bytes doesn't quite work. It ends up giving me 16 bytes, so I lose 2 bits (I'm not sure on which end I'm losing them).
To make up for this difference, I need to pad the ULID value up to a value where it will safely convert to bytes without data loss. This appears to be 32 characters. At 5 bits per character, this brings us to 160 bits, which safely decodes to 20 bytes, without any data loss. At this point, I can strip the first 4 bytes, and the remaining 16 bytes are the proper 128 bits that I expect.
Here's some pseudo-code to illustrate (the code is real... it's just the base32decode()
and base32encode()
functions that are made up for example purposes):
$ulid = '7ZZZZZZZZZZZZZZZZZZZZZZZZZ';
echo bin2hex(substr(base32decode(sprintf('%032s', $ulid)), 4));
This renders:
ffffffffffffffffffffffffffffffff
Exactly as expected. 😄
To convert back, you need to pad the bytes similarly (40 characters gets us back to 20 bytes, so we can easily do the conversion to base32 without any data loss):
// Max UUID with dashes removed.
$uuid = 'ffffffffffffffffffffffffffffffff';
echo substr(base32encode(hex2bin(sprintf('%040s', $uuid))), 6);
This renders:
7ZZZZZZZZZZZZZZZZZZZZZZZZZ
So, that sorts out the discrepancy I was seeing. I hope this helps anyone else who encounters this and goes "huh?" 😆
I also had this problem when I was implementing ULID for Java.
I learned that there are two ways to encode to base-32. One is to follow the rules of RFC-4648. The other way is by doing modulo operation, i.e, using the remainder operator '%'.
When we follow RFC-4648 algorithm, we need to add padding. RFC-4648 base-32 uses blocks of 40 bits. According to RFC-4648:
... When fewer than 40 input bits
are available in an input group, bits with value zero are added (on
the right) to form an integral number of 5-bit groups. ...
The ULID author uses the second strategy. See these lines of code:
let mod
let str = ""
for (; len > 0; len--) {
mod = now % ENCODING_LEN
str = ENCODING.charAt(mod) + str
now = (now - mod) / ENCODING_LEN
}
return str
I think this second way is very convenient because it is easier to implement and port to other languages, although it is less efficient than the method in RFC-4648.
This example shows how to encode a ULID to UUID and vice versa using the remainder operator:
#!/bin/env python3
base16 = '0123456789abcdef'
base32 = '0123456789ABCDEFGHJKMNPQRSTVWXYZ'
def encode(number, alphabet):
text = ''
radix = len(alphabet)
while number:
text += alphabet[number % radix]
number //= radix
return text[::-1] or '0'
def decode(text, alphabet):
number = 0
radix = len(alphabet)
for i in text:
number = number * radix + alphabet.index(i)
return number
def to_uuid(ulid):
# decode from base-32
number = decode(ulid, base32)
# encode the number to base-16
text = encode(number, base16)
# add hyphens to the base-16 string to form the canonical string
return text[0:8] + '-' + text[8:12] + '-' + text[12:16] + '-' + text[16:20] + '-' + text[20:32]
def to_ulid(uuid):
# remove hyphens from the canonical
# string to form the base-16 string
uuid = uuid.replace('-', '')
# decode from base-16
number = decode(uuid, base16)
# encode the number to base-32
return encode(number, base32)
def main():
#ulid_in = '5JRB77HQYQ8P08BWPJRF5Q7F06'
#uuid_in = 'b2c2ce78-dfd7-4580-85f2-d2c3cb73bc06'
ulid_in = '7ZZZZZZZZZZZZZZZZZZZZZZZZZ'
uuid_in = 'ffffffff-ffff-ffff-ffff-ffffffffffff'
uuid_out = to_uuid(ulid_in)
ulid_out = to_ulid(uuid_in)
print('ULID in: ', ulid_in)
print('ULID out:', ulid_out)
print('UUID in: ', uuid_in)
print('UUID out:', uuid_out)
if __name__ == '__main__':
main()
# Notes:
# 1. A standard UUID has 6 fixed bits: 4 in the version field and 2 in the variant field.
# 2. You must validate the UUID or ULID befor using to_ulid() and to_uuid(). Use REGEX.
I've run into an odd issue when decoding ULIDs to bytes and encoding bytes back to ULIDs.
TL;DR: Is the binary layout and byte order defined in the spec correct? If so, where is my error in logic?
Where are all 128 bits?
The specification states:
This is worded in such a way that I would expect
7ZZZZZZZZZZZZZZZZZZZZZZZZZ
, when decoded, to have all 128 bits set to1
, but that doesn't seem to be the case. Instead, this is what the bytes look like when converted to hexadecimal:This does not have all 128 bits set to
1
. To set all bits to1
, that first3
needs to be anf
, but when I do this (i.e.ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
), Crockford's base32 encodes the bytes as:Notice the
W
at the end.Data loss?
Furthermore, while
7ZZZZZZZZZZZZZZZZZZZZZZZZZ
converts to3f ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
, the reverse is not true. Encoding those same bytes back into Crockford's base32 renders7ZZZZZZZZZZZZZZZZZZZZZZZZW
.So, it appears that Crockford's base32 encoding results in some data loss?
Timestamp Magic?
Continuing on, the spec says:
If we assume the binary layout and byte order to be true (as specified here), then when we convert
7ZZZZZZZZZZZZZZZZZZZZZZZZZ
to bytes, we should be able to take the first 48 bits (6 bytes) and convert them to an integer, which should equal281474976710655
.Going back to my earlier point,
7ZZZZZZZZZZZZZZZZZZZZZZZZZ
decodes to bytes as:The first 48 bits of this are:
When converted to a decimal integer, we get the value
70368744177663
, which is not the same as281474976710655
. When converted to a date, we get4199-11-24 01:22:57.663 +00:00
.However, all the ULID implementations I tried appear to convert the date correctly to arrive at the value expected by the specification, so I must be doing something wrong.