ulid / spec

The canonical spec for ulid
GNU General Public License v3.0
9.73k stars 174 forks source link

Need help with discrepancy found in decoding ULID byte strings #73

Closed ramsey closed 2 years ago

ramsey commented 2 years ago

I've run into an odd issue when decoding ULIDs to bytes and encoding bytes back to ULIDs.

TL;DR: Is the binary layout and byte order defined in the spec correct? If so, where is my error in logic?

Where are all 128 bits?

The specification states:

Technically, a 26-character Base32 encoded string can contain 130 bits of information, whereas a ULID must only contain 128 bits. Therefore, the largest valid ULID encoded in Base32 is 7ZZZZZZZZZZZZZZZZZZZZZZZZZ

This is worded in such a way that I would expect 7ZZZZZZZZZZZZZZZZZZZZZZZZZ, when decoded, to have all 128 bits set to 1, but that doesn't seem to be the case. Instead, this is what the bytes look like when converted to hexadecimal:

3f ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

This does not have all 128 bits set to 1. To set all bits to 1, that first 3 needs to be an f, but when I do this (i.e. ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff), Crockford's base32 encodes the bytes as:

ZZZZZZZZZZZZZZZZZZZZZZZZZW

Notice the W at the end.

Data loss?

Furthermore, while 7ZZZZZZZZZZZZZZZZZZZZZZZZZ converts to 3f ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff, the reverse is not true. Encoding those same bytes back into Crockford's base32 renders 7ZZZZZZZZZZZZZZZZZZZZZZZZW.

So, it appears that Crockford's base32 encoding results in some data loss?

Timestamp Magic?

Continuing on, the spec says:

the largest valid ULID encoded in Base32 is 7ZZZZZZZZZZZZZZZZZZZZZZZZZ, which corresponds to an epoch time of 281474976710655 or 2 ^ 48 - 1

If we assume the binary layout and byte order to be true (as specified here), then when we convert 7ZZZZZZZZZZZZZZZZZZZZZZZZZ to bytes, we should be able to take the first 48 bits (6 bytes) and convert them to an integer, which should equal 281474976710655.

Going back to my earlier point, 7ZZZZZZZZZZZZZZZZZZZZZZZZZ decodes to bytes as:

3f ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

The first 48 bits of this are:

3f ff ff ff ff ff

When converted to a decimal integer, we get the value 70368744177663, which is not the same as 281474976710655. When converted to a date, we get 4199-11-24 01:22:57.663 +00:00.

However, all the ULID implementations I tried appear to convert the date correctly to arrive at the value expected by the specification, so I must be doing something wrong.

heiglandreas commented 2 years ago

Without having dug too deeply into it: the whole thing has 130 bits but you only need 128. So the first two bits should be ignored? But you are using the first 48 bits (0 through 47). Shouldn't you use the first 48 "usable" bits though (2 through 49)?

ramsey commented 2 years ago

Thanks to @heiglandreas and ar_turek on Twitter, I figured it out!

Since I'm working in PHP (and I suspect this is the same/similar in many other languages), I can only easily work in bytes and not bits, so decoding a 130-bit string to bytes doesn't quite work. It ends up giving me 16 bytes, so I lose 2 bits (I'm not sure on which end I'm losing them).

To make up for this difference, I need to pad the ULID value up to a value where it will safely convert to bytes without data loss. This appears to be 32 characters. At 5 bits per character, this brings us to 160 bits, which safely decodes to 20 bytes, without any data loss. At this point, I can strip the first 4 bytes, and the remaining 16 bytes are the proper 128 bits that I expect.

Here's some pseudo-code to illustrate (the code is real... it's just the base32decode() and base32encode() functions that are made up for example purposes):

$ulid = '7ZZZZZZZZZZZZZZZZZZZZZZZZZ';

echo bin2hex(substr(base32decode(sprintf('%032s', $ulid)), 4));

This renders:

ffffffffffffffffffffffffffffffff

Exactly as expected. 😄

To convert back, you need to pad the bytes similarly (40 characters gets us back to 20 bytes, so we can easily do the conversion to base32 without any data loss):

// Max UUID with dashes removed.
$uuid = 'ffffffffffffffffffffffffffffffff';

echo substr(base32encode(hex2bin(sprintf('%040s', $uuid))), 6);

This renders:

7ZZZZZZZZZZZZZZZZZZZZZZZZZ

So, that sorts out the discrepancy I was seeing. I hope this helps anyone else who encounters this and goes "huh?" 😆

fabiolimace commented 2 years ago

I also had this problem when I was implementing ULID for Java.

I learned that there are two ways to encode to base-32. One is to follow the rules of RFC-4648. The other way is by doing modulo operation, i.e, using the remainder operator '%'.

When we follow RFC-4648 algorithm, we need to add padding. RFC-4648 base-32 uses blocks of 40 bits. According to RFC-4648:

                                 ... When fewer than 40 input bits
   are available in an input group, bits with value zero are added (on
   the right) to form an integral number of 5-bit groups. ...

The ULID author uses the second strategy. See these lines of code:

  let mod
  let str = ""
  for (; len > 0; len--) {
    mod = now % ENCODING_LEN
    str = ENCODING.charAt(mod) + str
    now = (now - mod) / ENCODING_LEN
  }
  return str

I think this second way is very convenient because it is easier to implement and port to other languages, although it is less efficient than the method in RFC-4648.

This example shows how to encode a ULID to UUID and vice versa using the remainder operator:

#!/bin/env python3

base16 = '0123456789abcdef'
base32 = '0123456789ABCDEFGHJKMNPQRSTVWXYZ'

def encode(number, alphabet):
    text = ''
    radix = len(alphabet)
    while number:
        text += alphabet[number % radix]
        number //= radix
    return text[::-1] or '0'

def decode(text, alphabet):
    number = 0
    radix = len(alphabet)
    for i in text:
        number = number * radix + alphabet.index(i)
    return number

def to_uuid(ulid):

    # decode from base-32
    number = decode(ulid, base32)

    # encode the number to base-16
    text = encode(number, base16)

    # add hyphens to the base-16 string to form the canonical string
    return text[0:8] + '-' + text[8:12] + '-' + text[12:16] + '-' + text[16:20] + '-' + text[20:32]

def to_ulid(uuid):

    # remove hyphens from the canonical
    # string to form the base-16 string
    uuid = uuid.replace('-', '')

    # decode from base-16
    number = decode(uuid, base16)

    # encode the number to base-32
    return encode(number, base32)

def main():

    #ulid_in = '5JRB77HQYQ8P08BWPJRF5Q7F06'
    #uuid_in = 'b2c2ce78-dfd7-4580-85f2-d2c3cb73bc06'

    ulid_in = '7ZZZZZZZZZZZZZZZZZZZZZZZZZ'
    uuid_in = 'ffffffff-ffff-ffff-ffff-ffffffffffff'

    uuid_out = to_uuid(ulid_in)
    ulid_out = to_ulid(uuid_in)

    print('ULID in: ', ulid_in)
    print('ULID out:', ulid_out)
    print('UUID in: ', uuid_in)
    print('UUID out:', uuid_out)

if __name__ == '__main__':
    main()

# Notes:
# 1. A standard UUID has 6 fixed bits: 4 in the version field and 2 in the variant field.
# 2. You must validate the UUID or ULID befor using to_ulid() and to_uuid(). Use REGEX.