uuid6 / uuid6-ietf-draft

Next Generation UUID Formats
https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/
187 stars 11 forks source link

V02: Fixed Typos and added 32 to 36 bit padding clarification #21

Closed kyzer-davis closed 3 years ago

kyzer-davis commented 3 years ago

Updated V02 .xml (use for diff), .txt, and .html file

nerg4l commented 3 years ago
   In the event that 32 bit Unix Timestamp are in use; four zeros SHOULD
   be appended after the least significant (right-most) bits of the 32
   bit Unix timestamp creating the 36 bit Unix timestamp.

Why padding from right? If the goal is to be able to lexicographically sort UUIDv7s then I think padding from the left would be better.

Example:

If I would lexicographically sort UUIDv7s from both systems I would get an invalid order.

kyzer-davis commented 3 years ago

@nerg4l

If I would lexicographically sort UUIDv7s from both systems I would get an invalid order.

Wouldn't you always get a sort error in this scenario since mixing 32-bit and 64-bit Unix Epoch introduces all kinds of differences?


As for 32-bit start or end padding: Are there any current Unix epoch implementations that output 36-bit integer vs the normal 32-bit or 64-bit variants? If I had a reference, that would solidify where to pad.

That being said, I did toy around with both today while commenting on this thread: https://github.com/uuid6/prototypes/pull/2#issuecomment-895393501

I came to the following conclusions.

When using four zeros to start padding:

When using four zeros as end padding:

Python Test Code

import time

# 64-bit Unix Epoch
time64 = time.time_ns()
time32 = int(time.time())
time36_pad_a = f'{time32:032b}' + "0000"
time36_pad_b = "0000" + f'{time32:032b}'

print("Time Integers:")
print("Time64_int:           " + str(time64))
print("Time32_int:           " + str(time32))
print("Time36_pad_end_int:   " + str(int(time36_pad_a, 2)))
print("Time36_pad_start_int: " + str(int(time36_pad_b, 2)))

print("\nHex Method A:")
print("Time64_hex:           " + str(hex(time64)[2:].upper()))
print("Time32_hex:           " + str(hex(time32)[2:].upper()))
print("Time36_pad_end_hex:   " + str(hex(int(time36_pad_a, 2))[2:].upper()))
print("Time36_Pad_start_hex: " + str(hex(int(time36_pad_b, 2))[2:].upper()))

print("\nHex Method B:")
print("Time64_hex:           " + str(f'{time64:04x}'.upper()))
print("Time32_hex:           " + str(f'{time32:04x}'.upper()))
print("Time36_pad_end_hex:   " + str(f'{int(time36_pad_a, 2):04x}'.upper()))
print("Time36_Pad_start_hex: " + str(f'{int(time36_pad_b, 2):04x}'.upper()))

Sample Output:

Time Integers:
Time64_int:           1628544077388789100
Time32_int:           1628544077
Time36_pad_end_int:   26056705232
Time36_pad_start_int: 1628544077

Hex Method A:
Time64_hex:           1699C035C1D3356C
Time32_hex:           61119C4D
Time36_pad_end_hex:   61119C4D0
Time36_Pad_start_hex: 61119C4D

Hex Method B:
Time64_hex:           1699C035C1D3356C
Time32_hex:           61119C4D
Time36_pad_end_hex:   61119C4D0
Time36_Pad_start_hex: 61119C4D

I am open to changing to the alternative as long as we are consistent, it is documented and it does not introduce more issues than it fixes. @bradleypeabody, what do you think?

fabiolimace commented 3 years ago

Further this padding may lead to leading 0's in the UUID which may or may not be a problem as long as implementations treats the UUID as a string in the 8-4-4-4-12 format but when converting between hex/integer/binary this could be problematic.

Imho we shoud not add right pad because the hex() function removes the leading zeros.

RFC-4122 describes fields and operations on UUIDs in terms of integers, bits and bytes. The canonical format is just a string representation.

The RFC-4122 says regarding to padding: "Each field is treated as an integer and has its value printed as a zero-filled hexadecimal digit string with the most significant digit first." [1]

Another document I read a few days ago says: "The hexadecimal representation of a byte value is the two-character string created by expressing value in hexadecimal using ASCII lower hex digits, left-padded with '0' to reach two characters." [2] [3]

UUID3, UUID4, and UUID5 have 1/16 chance of starting with zeros. If an implementation removes leading zeros after converting from integer or binary to hexadecimal, the resulting string will not be a canonical representation. I think the same in the case of UUIDv7.

The 4 bits represented by that leading zero will be used ​​in the future after the year 2106. If we add the right pad, the timestamp will not be able to contain dates after that year (although I don't expect to be here on that date).

The leading zero is what we see today. In January 1st, 1977, we would see 2 leading zeros (00d2b0b8-0000-7000-8000-000000000000), but that doesn't mean we should pad 8 bits after the timestamp.

The right side padding recommendation creates 2 incompatible UUID7 sub-versions: one with padding and one without padding. If user needs to extract creation time from UUID7, how does he know if timestamp uses a right pad or not?

Also, 32-bit timestamp should be avoided due to the Year 2038 problem.

fabiolimace commented 3 years ago

I would like to propose the change of the time bit length to 34 bits. It solves the problem with the leading zero.

With a 34 bit timestamp, the maximum date is around 2514 AD. Far enough.

It also leaves 2 bits free to increase entropy in the UUID7.

I implemented this prototype in Java to test my proposal:

package com.github.uuid6;

import java.security.SecureRandom;
import java.time.Instant;
import java.util.UUID;

public final class Uuid7with34bits {

    private static long counter = 0;
    private static long prevTime = 0;

    private static final int VERSION = 7;

    private static final int I64_BITS = 64;
    private static final int SEC_BITS = 34; // unixts bits
    private static final int SUBSEC_BITS = 10; // millisecond precision
    private static final int COUNTER_BITS = 16;

    private static final int SEC_SHIFT = I64_BITS - SEC_BITS; // 30
    private static final int SUBSEC_SHIFT = I64_BITS - SEC_BITS - SUBSEC_BITS; // 20

    private static final long PRECISION = 1000L;
    private static final long SUBSEC_FACTOR = (1 << SUBSEC_BITS); // 2^10 = 1024
    private static final long COUNTER_LIMIT = (1 << COUNTER_BITS); // 2^16 = 65536

    protected static final SecureRandom SECURE_RANDOM = new SecureRandom();

    public static synchronized UUID next() {

        // get the current time
        final long time = System.currentTimeMillis();

        // get seconds and sub-seconds
        final long sec = time / PRECISION;
        final long subsec = (long) (((time % PRECISION) / (double) PRECISION) * SUBSEC_FACTOR);

        // increment the counter the time repeats
        if (time == prevTime) {
            if (++counter >= COUNTER_LIMIT) {
                counter = 0;
            }
        } else {
            counter = 0;
        }

        // concatenate `secs` with `subsecs` in the the most significant bits
        long msb = (sec << SEC_SHIFT) | (subsec << SUBSEC_SHIFT) //
                | (counter & 0b1111_0000_0000_0000) << 4 | (counter & 0b0000_1111_1111_1111);

        // apply the version number
        msb = (msb & 0xffffffffffff0fffL) | (VERSION << 12);

        // get random last significant bits
        long lsb = SECURE_RANDOM.nextLong();

        // set the variant number
        lsb = (lsb & 0x3fffffffffffffffL) | 0x8000000000000000L;

        // save the time
        prevTime = time;

        return new UUID(msb, lsb);
    }

    public static void main(String[] args) {

        System.out.println("List of UUID7 with 34 bits time:\n");
        for (int i = 0; i < 10; i++) {
            UUID uuid = Uuid7with34bits.next();
            System.out.println(uuid);

        }

        final long maximumUNIXTS = 0x3ffffffffL; // all 34 bits set to 1
        System.out.println("\nMaximum date with 34 bits:\n");
        System.out.println(Instant.ofEpochSecond(maximumUNIXTS));
        System.out.println(Instant.ofEpochSecond(maximumUNIXTS >> 1) + " (signed)");
    }
}

Output:

List of UUID7 with 34 bits time:

18447e50-28e0-7000-a4e1-daf2923af1ad
18447e50-28e0-7001-ac4c-e97252b04548
18447e50-28e0-7002-a4ff-6cda93a1448e
18447e50-28e0-7003-b785-07a57cc1bf92
18447e50-28e0-7004-a6cc-868e498afd74
18447e50-28e0-7005-a359-d5aebf1bd430
18447e50-28e0-7006-bade-56a5d809de63
18447e50-28e0-7007-a47a-f5c835c3baaa
18447e50-28e0-7008-96bf-2cf017506398
18447e50-28e0-7009-ac16-d9fad1a89ce6

Maximum date with 34 bits:

2514-05-30T01:53:03Z
2242-03-16T12:56:31Z (signed)
nerg4l commented 3 years ago

@fabiolimace

The right side padding recommendation creates 2 incompatible UUID7 sub-versions: one with padding and one without padding. If user needs to extract creation time from UUID7, how does he know if timestamp uses a right pad or not?

This one is also really important.


@kyzer-davis

Wouldn't you always get a sort error in this scenario since mixing 32-bit and 64-bit Unix Epoch introduces all kinds of differences?

If they represent seconds in both cases then it does not matter until 2038.

Are there any current Unix epoch implementations that output 36-bit integer vs the normal 32-bit or 64-bit variants? If I had a reference, that would solidify where to pad.

Time in programming is always problematic. I cannot provide any reference and from my experience it's all over the place. I can provide some examples in programming languages I'm familiar with.

Precision in bits:

JavaScript

Date.now() returns the number of milliseconds elapsed since Unix epoch. Source

Date.now() returns a float in IEEE-754 double-precision (64 bit) format and it doesn't uses fractional exponent part which means it can represent 53 bits (52 + sign). 53 - 20 = 33 53 - 10 = 43 which means 33 43 is the maximum bits available for unixts.

PHP

microtime(true) returns the number of microseconds elapsed since Unix epoch. Source

microtime(true) returns a float in IEEE-754 double-precision (64 bit) format and it uses fractional exponent part which means it can represent seconds in 53 bits (52 + sign). 53 - 20 = 33 which means 33 is the maximum bits available for unixts.

Go

time.Now().UnixNano() returns the number of nanoseconds since Unix epoch. Source

time.Now().UnixNano() returns int64 (63 + sign). 64 - 30 = 34 which means 34 is the maximum bits available for unixts.

Edit: Removed fractional where I meant exponent.

Edit 2: Fixed metric prefixes. Mixed up microseconds and milliseconds.

kyzer-davis commented 3 years ago

@fabiolimace @nerg4l good points all around.

I also went through some old email exchanged between Brad and I and his original goal aligns with yours. Fault here is all mine for reading this wrong. I just updated the timestamp section to appropriately convey this. (Note, not sure why the commit doesn't show up in this PR since it is on the same branch as the original PR? If it doesn't show up I will open another PR.)

Brad's Quotes in my inbox:

The first 36 bits are a big endian unsigned second-precise integer Unix timestamp.

By adding just a few bits to an unsigned integer we can add much more available time to a Unix timestamp. The maximum timestamp of a 36-bit unsigned value is in the year 4147.

36 bits is intentional, in order to avoid the year 2038 issue (unix timestamp at 2,147,483,647 is 19 Jan 2038 19:14:07 UTC). Most systems these days will return the unix timestamp as an int64 and then the lower 36 bits of this can be extracted.

The first 36 bits are a regular Unix timestamp. Most systems these days will receive this information as a 64-bit integer, and the least significant 36 bits can just be copied into the UUID verbatim.


I also fixed #22 along with a forward reference on using UUIDv8 if an implementation/application requires an untruncated 64 bit Unix timestamp vs UUIDv7's 36-bit variant.


@fabiolimace Let's continue to discuss 34 vs 36 for unixts in a separate issue. #23 @nerg4l I will also open an issue to continue discussion of the precision encoding. #24

nerg4l commented 3 years ago

@kyzer-davis I can confirm, your latest commit is not visible.

kyzer-davis commented 3 years ago

Yeah, let me close PR this and re-pull. I am not sure what happened :/