ulid / javascript

Universally Unique Lexicographically Sortable Identifier
MIT License
3.04k stars 107 forks source link

Fundamental problem #15

Closed mmacedoeu closed 7 years ago

mmacedoeu commented 7 years ago

Ulid is defined as 128 bit binary and base32 is a 5 bit space.

So if you divide 128/5 = 25.6 so if 80 bits / 16 chars is random them time is 10 chars / 50 bits 80 bits + 50 bits = 130 bits !!

So when decoding a 26 chars to 16 octets you will have data-loss When encoding back from octets to chars you will have missing data

Specs is not clear what to do. Time atm is zero padded so from octets to chars you can assume the 2 extra bits are zero to the left of the 128 and on decoding you have to trim the 2 bits from the left before anything.

alizain commented 7 years ago

The spec states that time must be a 48 bit integer only. The extra bits the string encoding format allows for is simply extra space that will never be used.

So whenever you take the 10 time characters, and convert them into bits, you will always get an integer that fits inside 48 bits (it'll be zero padded for the few thousand years anyways!).

Said another way, even if the entire 48 bits for time is used, the largest base32 string you could get is 7ZZZZZZZZZ. Only if you encode a number that is greater than that, you will have the data-loss you described above. However, the spec makes it clear that this situation will only occur in approximately 9000 years. It does not intentionally specify what to do when that time comes 😄

mmacedoeu commented 7 years ago

That's right when you encode time from the right to the left, so when decoding, the 2 bits data lost on the left will have no effect because only 48 bits is used anyway. It's just limiting the max time you can encode like already stated on spec. But just by reading the spec some could potentially encode 50 bits; the string representation of time requires 10x 5bits; from the left to the right meaning the 2 bits on the right would be lost. Most base32 encoding algos encode from the left to the right. But, for ulid, time encoding using this will lead to data truncation and unable to decode back to the original string representation. For the 80 bits random part there is no problem because there is exactly 16 x 5bits so you can encode/decode from the left or right without problems.

alizain commented 7 years ago

I agree, the encoding behaviour can be more explicit in the spec to avoid this confusion.

Would you like to submit a PR?

On Jan 19, 2017, at 10:16 AM, mmacedo notifications@github.com wrote:

That's right when you encode time from the right to the left, so when decoding, the 2 bits data lost on the left will have no effect because only 48 bits is used anyway. It's just limiting the max time you can encode like already stated on spec. But just by reading the spec some could potentially encode 50 bits; the string representation of time requires 10x 5bits; from the left to the right meaning the 2 bits on the right would be lost. Most base32 encoding algos encode from the left to the right. But, for ulid, time encoding using this will lead to data truncation and unable to decode back to the original string representation. For the 80 bits random part there is no problem because there is exactly 16 x 5bits so you can encode/decode from the left or right without problems.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

alizain commented 7 years ago

Updated the docs slightly, closing this issue. You're more than welcome to submit a PR for more clarification!