ulid / spec

The canonical spec for ulid
GNU General Public License v3.0
9.71k stars 174 forks source link

Would it make sense to extend the ULID to include an optional checksum? #88

Open mofosyne opened 1 year ago

mofosyne commented 1 year ago

This would be useful in the case that the ULID is communicated over the phone or by handwriting... Standardising the form would make it easier to recognise when it's being used and throw it away upon insertion into a database system... as well as how to regenerate it again and verify it's integrity when transmitted manually.

Looks like Crockford's Base32 which you are using does have some provision for appending an optional check symbol... however it does not provide a way to disambiguate between a checksummed data vs a non checksummed data... but for our purpose since the ULID is a fixed length... you could assume any extra data appended is the check symbol... anyway best to clarify that in the spec as an implementation recommendation

zdenek-crha commented 1 year ago

From specification:

Technically, a 26-character Base32 encoded string can contain 130 bits of information, whereas a ULID must only contain 128 bits. Therefore, the largest valid ULID encoded in Base32 is 7ZZZZZZZZZZZZZZZZZZZZZZZZZ

Seems to me that there are spare 2bits in the textual ULID representation that could be used for storing checksum. Care would need to be taken to not encode the checksum into first digit(s) to prevent it from breaking sortability.

mofosyne commented 1 year ago

Since ULID is fixed size, you likely could just use the length to figure out if there is a checksum payload. You could use the 2 bits to encode if there is 0 no checksum or if there is 1 or more digits of checksumming etc...

This post from a crockford32 implementer said that the standard itself doesn't have a way to distinguish if there is a checksum or not however. https://github.com/tinychameleon/crockford32/issues/1#issuecomment-1643168773

fabiolimace commented 1 year ago

If checksum is really needed by an application, one could use the Luhn Mod N Algorithm, which is an extension to the Luhn algorithm, which in turn "is used to validate a variety of identification numbers, such as credit card numbers, IMEI numbers, National Provider Identifier numbers in the United States" etc.

This is a Javascript example extracted from Wikipedia:


// Crockford's Base32
const codePoints = "0123456789ABCDEFGHJKMNPQRSTVWXYZ";

function numberOfValidInputCharacters() {
    return codePoints.length;
}

function codePointFromCharacter(character) {
    return codePoints.indexOf(character);
}

function characterFromCodePoint(codePoint) {
    return codePoints.charAt(codePoint);
}

function generateCheckCharacter(input) {
    let factor = 2;
    let sum = 0;
    let n = numberOfValidInputCharacters();

    // Starting from the right and working leftwards is easier since
    // the initial "factor" will always be "2".
    for (let i = input.length - 1; i >= 0; i--) {
        let codePoint = codePointFromCharacter(input.charAt(i));
        let addend = factor * codePoint;

        // Alternate the "factor" that each "codePoint" is multiplied by
        factor = (factor == 2) ? 1 : 2;

        // Sum the digits of the "addend" as expressed in base "n"
        addend = (Math.floor(addend / n)) + (addend % n);
        sum += addend;
    }

    // Calculate the number that must be added to the "sum"
    // to make it divisible by "n".
    let remainder = sum % n;
    let checkCodePoint = (n - remainder) % n;
    return characterFromCodePoint(checkCodePoint);
}

generatedULID = "0123456789ZZZZZZZZZZZZZZZZ"; // generatedULID = ulid()
truncatedULID = generatedULID.substr(0, 25); // delete the last char of the random component
checkCharacter = generateCheckCharacter(truncatedULID);
checkCharacterULID = truncatedULID + checkCharacter;

console.log("Generated ULID: " + generatedULID);
console.log("Check Character ULID: " + checkCharacterULID);

Output of the example:

// Generated ULID:       0123456789ZZZZZZZZZZZZZZZZ
// Check Character ULID: 0123456789ZZZZZZZZZZZZZZZE

Note that the last character of the original ULID, generated with ulid(), is replaced with a check character calculated by generateCheckCharacter().

Of course, by doing this, 5 bits will be "wasted". Still, it will be more advantageous than UUIDv7, which uses 6 fixed bits to mark its structure.

For Monotonic ULIDs, the first character of the random component should be deleted instead of the last one. So the statement that removes 1 character could be changed from this:

// delete the last character of the random component
truncatedULID = generatedULID.substr(0, 25);

to this:

// delete the FIRST character of the random component
truncatedULID = generatedULID.substr(0, 10) + generatedULID.substr(11, 15);

EDIT: fixed the base alphabet used in the constant codePoints.

mofosyne commented 1 year ago

Thanks for the example. I noticed that the code you provided is not actually Crockford's Base32 but is actually more of a generic base36, but you got the point across on the idea of modifying the ULID spec to add checksum support.

Did a bit of calculation and it's more like you would waste 3bits if you do a single character truncation of the last char of the textual ULID string.

# ULID binary stats
ULID_bit_size = 128
ULID_byteCount = ULID_bit_size/8 → 16

# ULID base32 stats
ULID_C32_charCount = 26
ULID_C32_bitlength = ULID_C32_charCount*5→ 130
ULID_C32_byteCount = ULID_C32_bitlength/8→ 16.25
ULID_C32_extraBits = ULID_C32_bitlength - ULID_bit_size→ 2

# if we truncate last crockford char of ULID
ULID_truncated_bits = (ULID_C32_charCount-1)*5→ 125
ULID_lost_bits = ULID_bit_size - ULID_truncated_bits→ 3

Eitherway, bit iffy of just changing the spec to store a checksum intended for humans to human communication into it, even if its only 3bits wasted. As it would increase the complexity of implementation and potential edge cases. Plus those who already uses it would be forced to figure out how to migrate their ID to the new format.

fabiolimace commented 1 year ago

Thanks for the example. I noticed that the code you provided is not actually Crockford's Base32 but is actually more of a generic base36...

Sorry. I forgot to copy and paste the correct alphabet from the spec README: 0123456789ABCDEFGHJKMNPQRSTVWXYZ. Fixed.

mofosyne commented 1 year ago

Ah not a problem. Anyway I've had a read around and I understand the checksum implementation a bit more now.

https://www.johndcook.com/blog/2018/12/28/check-sums-and-error-detection/

The check char is actually the ULID as a number but modulo by 37 and appended with a slightly extended character set (*, ~, $, =, and U) as explained in this quote below from Johnd Cook.

That is, we take the remainder when the base 32 number is divided by 37 and append the result to the original encoded number. The remainder could be larger than 31, so we need to expand our alphabet of symbols. Crockford recommends using the symbols *, ~, $, =, and U to represent remainders of 32, 33, 34, 35, and 36.

He was able to prove that this prime number choice actually has a reason in terms of detecting "wrong-symbol and transposed-symbol errors".

Anyway, I would be proposing that if we have 01ARZ3NDEKTSV4RRFFQ69G5FAV as the ID, we would simply append _<crockford mod 37 check char> e.g. 01ARZ3NDEKTSV4RRFFQ69G5FAV_6 (note check char in this example is made up). Adding _ makes it easier to double click select and also make it clear that the check char is separate from the main ID (so you can still visually cross reference the ID without getting confused.

So in short the textual format I can now propose is this (where ulid: is optional) (where _<crockford mod 37 check char> is optional). Try double clicking below to see how it reacts to selection:

01ARZ3NDEKTSV4RRFFQ69G5FAV
01ARZ3NDEKTSV4RRFFQ69G5FAV_6
ulid:01ARZ3NDEKTSV4RRFFQ69G5FAV_6

hence this format proposal

ulid:<crockford base32 encoded ULID>_<crockford mod 37 check char>

Just realised that the checksum character = makes it harder to put into url parameters... so you may want to just stick to crockford's base32 set and ignore his checksum behaviour. E.g. maybe ulid:<crockford base32 encoded ULID>_<crockford mod base32> where you would need to figure if the checksum should be a single char (5bits), two char (10bits) (or 8bits for implementation simplicty) or larger... in my opinion, a check of two char length is easy enough to speak over the phone. (e.g. 01ARZ3NDEKTSV4RRFFQ69G5FAV_XX where XX is crockford base32 symbols)