uuid6 / uuid6-ietf-draft

Next Generation UUID Formats
https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/
187 stars 11 forks source link

Some ideas for UUIDv7 #11

Closed fabiolimace closed 3 years ago

fabiolimace commented 3 years ago

@bradleypeabody, @kyzer-davis,

This is a continuation of my comment on PR #10

I thought of another approach. Instead of treating the integer part and the fractional part separately, the entire timestamp can be encoded in a fixed-point number.

I imagined another structure that reserves the last 32 bits for random bits or application-specific data. I'm not sure if anyone will need more than 90 bits for the timestamp. A field with 38 bits for the integer part and 52 bits for the fractional part is enough store a decimal number with 15 digits after the decimal point, like this: 0.123456789012345.

The structure described below also works with 48 bits reserved for milliseconds, requiring very small changes. By the way, I think that 48 bits for milliseconds may be more convenient in contexts where the best precision you can get is milliseconds, as in Java 8, because you don't need to encode the sub-millisecond part.

Sorry for sending spam to your project. I am very happy with the progress of this RFC update. I'm just trying to contribute some ideas, and I don't expect them to be accepted.

Structure with 90 bits for timestamp

This proposed structure consists of two fields: unixtime and random. The only mandatory requirements are to fill in the first 38 bits (48 bits for ms is better?) of the unixtime field with the current seconds and fill in the 6 bits reserved for version and variant with their standardized values. The number of sub-second bits in the unixtime field is variable/arbitrary. And the random field does not have to be filled with random bits. The name of this field is just a hint that its value by default is random.

unixtime: a 90-bit field reserved for the Unix timestamp. This field is encoded as a fixed-point number in the form fixed<38,n>, where 38 is the number of bits before the binary point, and n is the number of bits after the binary point. The size n is arbitrary so that it can be defined by the system that generates the UUID. The bits not used in this field can be filled with a sequence counter and/or some random bits, for example. The timestamp can represent dates up to the year 10680 A.D. (2^38/60/60/24/365.25 + 1970).

random: a 32-bit field reserved for random bits or application-specific data. Part or all of this field can be used, for example, to identify the UUID generator. This field is opaque, so its bits have no meaning, except for the system that generates the UUID.

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                           unixtime                            |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |            unixtime           |  ver  |        unixtime       |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |var|                       unixtime                            |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                            random                             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Encoding and decoding the sub-second part

Say you have a fractional value between 0 and 1 represented as f, you can encode that fractional value with the precision of n bits like this:

subsec = f * 2^n

To decode a subsec with n precision you do the inverse operation:

f = subsec / 2^n

This tutorial describes an efficient method for converting floating-point to fixed-point in MATLAB: Fixed-Point.pdf.

Generic algorithm to generate a UUIDv7

  1. Get the current Unix second and sub-second;
  2. Fill the first 38 bits of the UUID with the current Unix seconds;
  3. If the sub-second is not a fraction between 0 and 1, divide it by the time precision provided by the system. If the precision is millisecond, divide the sub-second value by 1,000. If the precision is microsecond, divide it by 1,000,000 and so on.
  4. Multiply the fraction of sub-seconds by 2^n, wheren is the number of bits that fits to the system time precision. If the precision is millisecond, n is equal to 10 (2^10 = 1,024). If the precision is microsecond, n is equal to 20 (2^20 = 1.048.576) and so on;
  5. Fill in the next n bits of the UUID with the n bits of the previous multiplication product;
  6. Fill the rest of the bits with random data;
  7. Insert the version and the variant bits in their standard positions using bitwise operations;
  8. Return the UUID.

Generic algorithm to extract the unix second

  1. Extract the first 38 bits of the UUID;
  2. Return the resulting unsigned integer.

Generic algorithm to extract the sub-second fraction

  1. Extract the first n bits starting from the 38th bit position of the UUID, where n is the number of bits that fits the system time precision;
  2. Divide the integer value obtained by 2^n to get a fractional value;
  3. Return the resulting fractional value.

To obtain a floating-point representation of the timestamp you can sum the two parts.

An implementation in python

Yesterday I implemented this simple python generator to test the concept:

#!/usr/bin/python3
#
# @date: 2021-03-06
# @author: Fabio Lima
#
# Changelog:
# - 2021-03-16: add parameter `subsec_decimal_digits` to function `uuid7()`
# - 2021-03-08: use `time.time_ns()` instead of `time.time()` to fix nanosecond precision
#

from uuid import UUID
from time import time_ns
from random import randint

TOTAL_BITS=128
VERSION_BITS = 4
VARIANT_BITS = 2

# Binary digits before the binary point
SEC_BITS=38

# Binary digits after the binary point
SUBSEC_BITS_S=0
SUBSEC_BITS_MS=10
SUBSEC_BITS_US=20
SUBSEC_BITS_NS=30
SUBSEC_BITS_DEFAULT=SUBSEC_BITS_NS

# Decimal digits after the decimal point
SUBSEC_DECIMAL_DIGITS_S=0   # 0
SUBSEC_DECIMAL_DIGITS_MS=3  # 0.999
SUBSEC_DECIMAL_DIGITS_US=6  # 0.999999
SUBSEC_DECIMAL_DIGITS_NS=9  # 0.999999999
SUBSEC_DECIMAL_DIGITS_DEFAULT=SUBSEC_DECIMAL_DIGITS_NS

SLICE_MASK_0 = 0xffffffffffff00000000000000000000
SLICE_MASK_1 = 0x0000000000000fff0000000000000000
SLICE_MASK_2 = 0x00000000000000003fffffffffffffff

def uuid7(t=None, subsec_bits=SUBSEC_BITS_DEFAULT, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT):

    if t == None:
        t = time_ns()

    i = get_integer_part(t)
    f = get_fractional_part(t, subsec_decimal_digits)

    sec = i
    subsec = round(f * (2 ** subsec_bits))

    node_bits = (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS - subsec_bits)

    uuid_sec = sec << (subsec_bits + node_bits)
    uuid_subsec = subsec << node_bits
    uuid_node = randint(0, (2 ** node_bits))

    uuid_int = uuid_sec | uuid_subsec | uuid_node # 122 bits
    uuid_int = __add_version__(uuid_int) # 128 bits

    return UUID(int=uuid_int)

def uuid7_s(t=None):
    return uuid7(t, SUBSEC_BITS_S, SUBSEC_DECIMAL_DIGITS_S)

def uuid7_ms(t=None):
    return uuid7(t, SUBSEC_BITS_MS, SUBSEC_DECIMAL_DIGITS_MS)

def uuid7_us(t=None):
    return uuid7(t, SUBSEC_BITS_US, SUBSEC_DECIMAL_DIGITS_US)

def uuid7_ns(t=None):
    return uuid7(t, SUBSEC_BITS_NS, SUBSEC_DECIMAL_DIGITS_NS)

def __add_version__(uuid_int):

    slice_mask_0 = SLICE_MASK_0 >> (VERSION_BITS + VARIANT_BITS)
    slice_mask_1 = SLICE_MASK_1 >> (VARIANT_BITS)
    slice_mask_2 = SLICE_MASK_2

    slice_0 = (uuid_int & slice_mask_0) << (VERSION_BITS + VARIANT_BITS)
    slice_1 = (uuid_int & slice_mask_1) << (VARIANT_BITS)
    slice_2 = (uuid_int & slice_mask_2)

    uuid_output = slice_0 | slice_1 | slice_2
    uuid_output = uuid_output & 0xffffffffffff0fff3fffffffffffffff # clear version
    uuid_output = uuid_output | 0x00000000000070008000000000000000 # apply version

    return uuid_output

def __rem_version__(uuid_int):

    slice_0 = (uuid_int & SLICE_MASK_0) >> (VERSION_BITS + VARIANT_BITS)
    slice_1 = (uuid_int & SLICE_MASK_1) >> (VARIANT_BITS)
    slice_2 = (uuid_int & SLICE_MASK_2)

    uuid_output = slice_0 | slice_1 | slice_2

    return uuid_output

def get_integer_part(t):
    SUBSEC_DECIMAL_DIGITS_PYTHON=9
    subsec_decimal_divisor = (10 ** SUBSEC_DECIMAL_DIGITS_PYTHON)
    return int(t / subsec_decimal_divisor)

def get_fractional_part(t, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT):
    SUBSEC_DECIMAL_DIGITS_PYTHON=9
    subsec_decimal_divisor = (10 ** SUBSEC_DECIMAL_DIGITS_PYTHON)
    return round((t % subsec_decimal_divisor) / subsec_decimal_divisor, subsec_decimal_digits)

def extract_sec(uuid):
    uuid_int = __rem_version__(uuid.int)
    uuid_sec = uuid_int >> (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS)    
    return uuid_sec

def extract_subsec(uuid, subsec_bits=SUBSEC_BITS_DEFAULT, subsec_decimal_digits=SUBSEC_DECIMAL_DIGITS_DEFAULT):
    uuid_int = __rem_version__(uuid.int)
    node_bits = (TOTAL_BITS - VERSION_BITS - VARIANT_BITS - SEC_BITS - subsec_bits)
    uuid_subsec = (uuid_int >> node_bits) & ((1 << (subsec_bits)) - 1)
    return round(uuid_subsec / (2 ** subsec_bits), subsec_decimal_digits)

def list():

    print("UUIDv7                               sec in       sec out      subsec in    subsec out")

    for i in range(10):
        t = time_ns()
        u = uuid7(t)
        i = get_integer_part(t)
        f = get_fractional_part(t)
        sec = extract_sec(u)
        subsec = extract_subsec(u)
        print(u, str(i).ljust(12), str(sec).ljust(12), str(f).ljust(12), str(subsec).ljust(12))

list()

OUTPUT:

UUIDv7                               sec in       sec out      subsec in    subsec out
018145e0-5fc5-7a65-ae82-507fbfe22749 1615951895   1615951895   0.943017418  0.943017418 
018145e0-5fc5-7bad-b660-9ddd65978a4c 1615951895   1615951895   0.943095648  0.943095648 
018145e0-5fc5-7c84-a49c-a9797ed70dc4 1615951895   1615951895   0.943146842  0.943146842 
018145e0-5fc5-7d46-b435-4c76da9142c5 1615951895   1615951895   0.943193153  0.943193153 
018145e0-5fc5-7de2-8716-009c3130d332 1615951895   1615951895   0.943230178  0.943230178 
018145e0-5fc5-7e74-9231-05a35740fba0 1615951895   1615951895   0.943265028  0.943265028 
018145e0-5fc5-7f4a-8022-e1ba32698bc3 1615951895   1615951895   0.943315983  0.943315983 
018145e0-5fc5-7fd5-a568-b1d7bc223a91 1615951895   1615951895   0.943349262  0.943349262 
018145e0-5fc6-7062-8b99-ba3f383d9b8e 1615951895   1615951895   0.943382783  0.943382783 
018145e0-5fc6-70ed-98ce-fa5bf04836d8 1615951895   1615951895   0.943415972  0.943415972

Python time precision

The first version of the generator above used time.time(). This method appears to return floating point numbers with 22 bits after the binary point. I had to replace it with time.time_ns() to fix the nanosecond precision. If you need nanosecond precision, don't use time.time().

This loop prints the sub-seconds returned by time.time() in binary format:

from time import time

for i in range(10):

    n = 30 # system time precision
    t = time() # get the current time
    i = int(t) # integer part
    f = t - int(t) # fractional part

    # encode the fractional part to an integer
    subsec = int(f * (2**n))    

    # print the result in binary format
    print(bin(subsec))

This is the output of the previous loop:

  |---------30 bits-----------|
0b10011010010000101110100000000
0b10011010010010000100100000000
0b10011010010010100100000000000
0b10011010010011010001000000000
0b10011010010011111000100000000
0b10011010010100001100100000000
0b10011010010100011111000000000
0b10011010010100110001100000000
0b10011010010101000011100000000
0b10011010010101010110000000000
  |-------------------|-------|
                        empty

Note that the last 8 bits are zeros.

bradleypeabody commented 3 years ago

Thanks @fabiolimace and sorry for the delayed reply.

Yes, I think you are describing the idea that we're moving forward with. An important concept here is that the timestamp is the only useful thing that other systems may need to extract from the UUID, and that systems may require different precisions of sub-second timestamps.

Because of these properties, one system can encode the sub-second part with n bits of precision, and another system can decode it with n-1 or n+1 precision and they will get approximately the same resulting number - with the according loss of precision due to either truncation or reading extra random bits into the sub-second timestamp part. But the caller does not need to know or care about this - which provides for excellent interoperability.

I'm going to try to hammer out the last edits that are in my court now/very soon and coordinate with @kyzer-davis about getting this new draft submitted.

bradleypeabody commented 3 years ago

For the seconds part, I'm going to try with 36 bits. This will continue to work for over 2000 years into the future; and it also makes it so if you're decoding with millisecond precision you can read right up to the ver field for the ms part.

fabiolimace commented 3 years ago

I also think that 36 bits are enough for the seconds. It allows the maximum timestamp to be close to 4147 A.D. It is more than the theoretical limit of 3400 A.D. in RFC-4122, considering a signed timestamp.

I am happy that the project is still moving forward.

kyzer-davis commented 3 years ago

@fabiolimace Do you have cycles to review the current IETF published RFC draft v01 for UUIDv7, make any modifications to your sample python code and open a PR to add this UUIDv7 python prototype to the code over on https://github.com/uuid6/prototypes ?

If not let me know and I will transpose this code with the required changes to merge it into the python prototypes while I am doing some work in the branch draft-01-updates .

fabiolimace commented 3 years ago

@ kyzer-davis I can open a PR next weekend to add a UUIDv7 prototype to the code at https://github.com/uuid6/prototypes

fabiolimace commented 3 years ago

@kyzer-davis sorry for the delayed reply.

I opened a pull request: https://github.com/uuid6/prototypes/pull/2

fabiolimace commented 3 years ago

I'm closing this issue due to the new draft. #14

CMCDragonkai commented 2 years ago

I'm trying to implement this in JS, just a question about the 36 bits and the 12 bits for uuidv7 and big-endian order.

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                            unixts                             |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |unixts |         msec          |  ver  |          seq          |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |var|                         rand                              |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                             rand                              |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

For unixts and msec, together they are 48 bits.

If I had unixts as an integer meaning seconds, and msec as an integer meaning milliseconds.

Would it be correct to do:

const unixts = 1633576858;
const msec = 944;
const combined = (unixts << 12) | (0x0fff & msec);

// maybe if i truncate it to the first 36 bits first?
const combined2 = ((0xFFFFFFFFF & unixts) << 12) | (0x0fff & msec);

Then with combined I could take the first 5 bytes and put it into a typed array in big-endian order?

Actually I realised that the unixts number is too big, the shifting by 12 is not going to work, unless I were to first truncate it...

It seems this could work: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt

CMCDragonkai commented 2 years ago

Just wanted to point out that I've been making good progress on js-id https://github.com/MatrixAI/js-id, and solved the above problems. For now using bit strings, although the library can be made more efficient with bitwise operators in the future.

I've also noticed that UUIDv7 can support logical timestamps as well, could be a useful usecase when you don't want to leak the wall clock time, but still preserve ordering of IDs.

bradleypeabody commented 2 years ago

Hi @CMCDragonkai - thanks for your work on this. Glad to hear it's coming along.

One thing to keep in mind as that UUIDv7 is still a draft and various ideas are being considered (and have gone back and forth), so keep that in mind that the spec will almost certainly end up changing.

I've also noticed that UUIDv7 can support logical timestamps as well, could be a useful usecase when you don't want to leak the wall clock time, but still preserve ordering of IDs. Yeah the time does not need to be super granular or have accuracy guarantees, and in fact this is one of the issues that has come up and is leaning toward only having millisecond precision in the time stamp.

CMCDragonkai commented 2 years ago

Thanks, we need to use it in our projects now, so it's going into production. Hopefully any changes won't be a big deal and our current situation will be compatible.

CMCDragonkai commented 2 years ago

BTW I encountered a problem with the fixed point algorithm that wasn't clear with the reference python implementation I was following and the spec discussed above. With 12 bits, if the fixed point conversion resulted in the number 4096, this would cause the 12 bit sector to overflow and be all zeros. This corrupted the sort order of the ids.

To solve this in my implementation:

/**
 * Converts floating point number to a fixed point tuple
 * Size is number of bits allocated for the fractional
 * Precision dictates a fixed number of decimal places for the fractional
 */
function toFixedPoint(
  floating: number,
  size: number,
  precision?: number,
): [number, number] {
  let integer = Math.trunc(floating);
  let fractional: number;
  if (precision == null) {
    fractional = floating % 1;
  } else {
    fractional = roundPrecise(floating % 1, precision);
  }
  // If the fractional is rounded to 1
  // then it should be added to the integer
  if (fractional === 1) {
    integer += fractional;
    fractional = 0;
  }
  // Floor is used to round down to a number that can be represented by the bit size
  // if ceil or round was used, it's possible to return a number that would overflow the bit size
  // for example if 12 bits is used, then 4096 would overflow to all zeros
  // the maximum for 12 bit is 4095
  const fractionalFixed = Math.floor(fractional * 2 ** size);
  return [integer, fractionalFixed];
}

Basically after acquiring the fractional, I check whether it is 1 and if so, add it to the integer. Otherwise when it goes to being converted to fixed point number, it's still essential to use Math.floor so it can never get 4096.

Details on my fix is here: https://github.com/MatrixAI/js-id/pull/8