Closed rogusdev closed 2 months ago
Hi @rogusdev :wave:
I'd be keen to go catch up on some of the discussion around that. Those sections look like recommendations rather than requirements, so I think uuid
's current implementation that fills the entire 74 bits of rand_a
and rand_b
with random data is still compliant.
It seems a bit overkill to me to need a clock, a source of randomness, and a monotonic counter to generate V7 UUIDs, but we could consider adding *_monotonic
variants of the v7
methods that accept a ClockSequence
like v1
and v6
do.
The test vector for UUIDv7 appears to use a fully random value for rand_a
and rand_b
: https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04#name-example-of-a-uuidv7-value
I think we should consider supporting this, but don't think we need to block stabilizing the current APIs on it.
From my perspective, while the main point of uuid v7 is to put the clock at the front, so that db indexes can btree on them properly, counters add additional safety mechanisms that are quite to my taste: if you are generating ids in a single thread, you are guaranteed that no duplicates can happen, if there are less than the counter limit per millisecond. While the degree of randomness involved otherwise is certainly extremely unlikely to have duplicates, I am a big fan of guarantees.
Which is in fact why they put it into the RFC, as that is also a feature in many other popular id libraries.
That said, "SHOULD" is indeed a recommendation, rather than requirement, and they list multiple alternatives. So I support having a separate set of ctor functions to call for the counter version(s).
Hi Ashley,
I'm Sergey Prokhorenko, a contributor to rfc4122bis and a counter enthusiast.
Some developers implement the counter:
UUIDv7 with the counter for PostgreSQL is currently being developed in C by Andrey Borodin:
But it seems to me that Rust is also a good tool for such development.
We've currently actually got access to a counter in our Timestamp
type, because that's where it gets stashed for v1
and v6
UUIDs. How does the following sound to you:
Timestamp.counter
on v1
or v6
; just make it always available.Uuid::new_v7
, otherwise use random bytes.Builder::from_timestamp_millis_counter(millis: u64, counter: u16, random_bytes: [u8; 6])
method.An alternative for 2. would be to add Uuid::new_v7_counter
and Uuid::now_v7_counter
methods that use the counter.
We've currently actually got access to a counter in our
Timestamp
type, because that's where it gets stashed forv1
andv6
UUIDs. How does the following sound to you:
- Don't feature gate
Timestamp.counter
onv1
orv6
; just make it always available.
Like the vast majority, I am convinced that only the seventh version is worthy of attention. There is no need to waste time and effort implementing other versions.
- Use the counter value if it's non-zero in
Uuid::new_v7
, otherwise use random bytes.
I'm against. A counter initialized to zero increases the collision probability. Worse, the counter value may be the same as the random value used instead of the counter at the beginning of the millisecond.
I prefer a counter that is initialized to a random value at the beginning of each millisecond. But this may reduce the actual capacity of the counter. Therefore, the leftmost bit of the counter should be initialized to zero and/or the timestamp should be incremented when the counter overflows.
- Add a
Builder::from_timestamp_millis_counter(millis: u64, counter: u16, random_bytes: [u8; 6])
method.
The timestamp in the seventh version is shorter, and you missed the ver and var segments. My suggestion:
Segment length, bits | Field in RFC | Segment content |
---|---|---|
48 | unix_ts_ms | Timestamp |
4 | ver | Version |
1 | rand_a | Counter segment initialized to zero |
11 | rand_a | Counter segment initialized with a pseudorandom number |
2 | var | Variant |
30 | rand_b | Counter segment initialized with a pseudorandom number |
32 | rand_b | UUIDv7 segment filled with a pseudorandom number |
64 | Optional segment to the right of the UUID in key database columns |
An alternative for 2. would be to add
Uuid::new_v7_counter
andUuid::now_v7_counter
methods that use the counter.
If we are talking about initializing the counter every millisecond with a random value and incrementing within a millisecond, then I am for it.
This library already implements v1
and v6
versions and has some established API conventions that we need to remain consistency with. I think we should be able to support everything suggested with some new Uuid::new_v7_counter
and Uuid::now_v7_counter
methods that accept an impl ClockSequence
. The implementation of that ClockSequence
is what would handle time and randomness.
I highly recommend looking at the UUIDv7 implementation in PostgreSQL v17 by Andrey Borodin
Thanks @sergeyprokhorenko :+1: Having a good reference implementation to point at will definitely be helpful
Another good implementation: https://github.com/LiosK/uuid7-rs/tree/8b6362ac9bd4b0a24639ebe8365e4f6f6cacec75 https://crates.io/crates/uuid7
It is mentioned in this benchmark.
Due to the excessively long counter (initialized with a random number every millisecond), less entropy is generated, which requires a lot of resources.
But it's worth trying to replace rand::rngs::OsRng with openssl-rand for better performance.
This patch has already been successfully tested: https://ardentperf.com/2024/02/03/uuid-benchmark-war/
Thought/Question: Why not add a builder pattern with defaults? That doesn't eliminate the backwards compatibility and can improve clarity around generating the UUID.
The API in the Rust ULID crate provides monotonicity using an increment(&self) -> Option<Ulid>
method. This means you can create batches of IDs by generating a single ulid and then increment the random part to create all the batches.
If you know that you are generating a batch then an increment
method should be "sufficient logic for organizing the creation order of those one-thousand UUIDs" (quoting from the RFC).
Here's is an example:
use ulid::Ulid;
use uuid::Uuid;
fn main() {
let mut uuids = Vec::with_capacity(10);
for _ in 0..10 {
let uuid = Uuid::now_v7();
uuids.push(uuid);
}
for uuid in uuids.iter() {
println!("{:<12} {uuid}", "uuidv7:");
}
println!("");
let mut ulids = Vec::with_capacity(10);
let mut ulid = Ulid::new();
for _ in 0..10 {
ulids.push(ulid);
ulid = ulid.increment().unwrap();
}
for ulid in ulids.iter() {
println!("{:<12} {ulid}", "ulid:");
}
}
Is taking blatant inspiration from the excellent work in the ULID crate ok? Of course. ULID is the very first referred to time-based unique ID referenced in the new RFC.
Note that this also gives more control over creation of batches. If one creates a batch of 1000 UUIDs, using an internal counter you have no control over whether they will be in the same millisecond. This means you will have part of the batch spread across different timestamps and the random component will jump around for each millisecond. Using an increment you can keep the millisecond and the random component stable. It's not required in the specification but seems useful to be able to trivially see if UUIDs are part of the same batch.
If one creates a batch of 1000 UUIDs, using an internal counter you have no control over whether they will be in the same millisecond.
This is not true
New functions for generating UUIDv7 with a counter in the ClickHouse DBMS: link
The structure is exactly the same as https://crates.io/crates/uuid7 and https://www.npmjs.com/package/uuidv7
I've started working on an implementation over in #755 that supports respecting the counter value when constructing v7 UUIDs in a way that lets a caller decide how wide they want that counter to be so the guarantees about ordering and uniqueness are configurable. It's still just a draft so will ping once it's ready.
This is not true
How might you allow the caller to control whether the UUIDs are generated for the same millisecond if the timer is internal to the uuid generator?
This is not true
How might you allow the caller to control whether the UUIDs are generated for the same millisecond if the timer is internal to the uuid generator?
UUIDs generated in different milliseconds have different timestamp values. UUIDs generated at the same millisecond have the same timestamp values.
@KodrAus
I've started working on an implementation over in #755 that supports respecting the counter value when constructing v7 UUIDs in a way that lets a caller decide how wide they want that counter to be so the guarantees about ordering and uniqueness are configurable. It's still just a draft so will ping once it's ready.
RFC9562 "Universally Unique IDentifiers (UUID)" has already been published and entered into force.
Differences between the implementation for DBMS ClickHouse and the implementation for DBMS PostgreSQL:
- Counter is 42 bits, not 18. The counter have no guard bits, every bit is initialized with random number on time ticks.
- By default counter is shared between threads. Alternative function generateUUIDv7ThreadMonotonic() provides thread-local counter.
In both implementations in case the counter overflows, the timestamp field is incremented by 1 and the counter is reset to a random new start value (guard bit to zero).
I would also recommend to add the ability to shift the timestamp by the value specified by the formal parameter timestamp_shift
of the function uuidv7(timestamp_shift)
. This will provide at least two benefits:
This is allowed by RFC9562:
Implementations MAY alter the actual timestamp. Some examples include security considerations around providing a real-clock value within a UUID to 1) correct inaccurate clocks, 2) handle leap seconds, or 3) obtain a millisecond value by dividing by 1024 (or some other value) for performance reasons (instead of dividing a number of microseconds by 1000). This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time
It is quite difficult to implement the exclusion (in the unpredictable future) of leap seconds required by RFC9562:
UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded.
In both implementations in case the counter overflows, the timestamp field is incremented by 1 and the counter is reset to a random new start value (guard bit to zero).
In order to maintain the invariant that UUIDs are sortable by the order they’re generated in this would need to increment the timestamp for all subsequently generated UUIDs in that millisecond, right?
In order to maintain the invariant that UUIDs are sortable by the order they’re generated in this would need to increment the timestamp for all subsequently generated UUIDs in that millisecond, right?
Yes that's right. Moreover, this may continue for several consecutive milliseconds of high generation rate until the real time catches up with the timestamp.
But with a guard bit and a counter 18 bits or longer, a counter overflow is an impossible event (until a quantum computer came along).
UUIDv7 extension written in RUST, implemented in PostgreSQL
Long story short...
All of these calculations make the mistaken implicit assumption that all generated UUIDs end up in a common database table where they can collide.
The absurdity of these calculations can be confirmed by the following example: UUIDv7s with a counter generated by one generator will never collide, even if random parts are completely removed from them. This is just as true as integer keys generated by autoincrement do not collide. The random part is necessary for the uniqueness of UUIDs generated by different generators (when merging database tables or at distributed UUID generation).
By the way, the counter is initialized with a random number every millisecond. Therefore, it is incorrect to say that the counter reduces the random part.
@wdhwg001
The authors of RFC 9562 did not intend to reject any of the contributors' proposals, but simply to combine all of these proposals. This does not mean that all these proposals were justified.
But in real implementations Method 1 is used, which is not at all divided into two opposing strategies, but is single. Your proposal differs only in the absence of the random part generated for each individual UUID. It kills the unguessability without giving anything in return.
@wdhwg001
The authors and contributors of RFC 9562 have detailed knowledge of monotonic ULIDs. Moreover, monotonic ULIDs were the basis for the development of UUIDv7. However, authors and contributors clearly saw the disadvantages of monotonic ULIDs. To address these disadvantages, UUIDv7 added a counter that is initialized with a random number and is protected from overflow. Adding a counter didn't make anything worse. It seems to me that there is no point in continuing this discussion, because you are not making valid arguments.
Discussion about the two formal parameters of the function that generates UUIDv7: https://github.com/ClickHouse/ClickHouse/pull/62852#issuecomment-2143736912
In either case, uuid
uses the ClockSequence
trait for controlling how counters work so this behavior is configurable. If you use the NoContext
type then you'll get fully random bytes. If you use the new CounterV7
(or whatever it ends up being called) type then you'll get a 42bit counter with the rest random. You can also implement your own ClockSequence
to further tweak behavior.
I've gotten around to implementing a counter for version 7 UUIDs based on the recommendations here: https://github.com/uuid-rs/uuid/pull/755/files#diff-df69eeff48c2663ed59b31541c6c396ffecfed985b395daadafda019a524f75cR627-R674
EDIT: Looks like GitHub is collapsing the relevant part of the diff, so I'll just drop it in here:
fn generate_timestamp_sequence(
&self,
seconds: u64,
subsec_nanos: u32,
) -> (Self::Output, u64, u32) {
use std::time::Duration;
let millis = (seconds * 1000).saturating_add(subsec_nanos as u64 / 1_000_000);
let last_reseed = self.last_reseed.get();
// If the observed system time has shifted forwards then regenerate the counter
if millis > last_reseed {
// Leave the most significant bit unset
// This guarantees the counter has at least 2,199,023,255,552
// values before it will overflow, which is exceptionally unlikely
// even in the worst case
let counter = crate::rng::u64() & (u64::MAX >> 23);
self.counter.set(counter);
self.last_reseed.set(millis);
(counter, seconds, subsec_nanos)
}
// If the observed system time has not shifted forwards then increment the counter
else {
// If the incoming timestamp is earlier than the last observed one then
// use it instead. This may happen if the system clock jitters, or if the counter
// has wrapped and the timestamp is artificially incremented
let millis = last_reseed;
// Guaranteed to never overflow u64
let counter = self.counter.get() + 1;
// If the counter has not overflowed its 42-bit storage then return it
if counter <= u64::MAX >> 22 {
self.counter.set(counter);
(counter, seconds, subsec_nanos)
}
// Unlikely: If the counter has overflowed its 42-bit storage then wrap it
// and increment the timestamp. Until the observed system time shifts past
// this incremented value, all timestamps will use it to maintain monotonicity
else {
let counter = 0;
self.counter.set(counter);
self.last_reseed.set(millis + 1);
let new_ts =
Duration::new(seconds, subsec_nanos) + Duration::from_millis(1);
(counter, new_ts.as_secs(), new_ts.subsec_nanos())
}
}
}
I'd appreciate a review of the parameters to make sure they're reasonable to most users. The counter:
Uuid::now_v7()
uses thread-local instances of this counter instead of purely random bytes. Space after the counter is filled with random bytes. This is to make sure if you call let a = Uuid::now_v7(); let b = Uuid::now_v7()
you guarantee a < b
. Since counters are thread-local, there is no guaranteed ordering across threads. If this is unsuitable for most users then I can base now_v7()
off a global lock. Personally, I think thread-local ordering is ok, and if you want global ordering, you can use Uuid::new_v7()
.
@KodrAus
I would recommend:
timestamp_offset
(positive or negative, data type i64
) in milliseconds.uuidv7()
.Thanks @sergeyprokhorenko. We unfortunately can’t directly do 2. or 3. because of backwards compatibility we can’t add or remove function parameters or change function names, but I can use a global lock internally.
You should still be able to shift timestamps when constructing UUIDs yourself through Uuid::new_v7
, whether this happens before or after getting the counter value. I’ll make sure we’ve got an example of this in the repository.
I also noticed it was pointed out other implementations reset the counter to a random value instead of zero on overflow so can make that change too.
I've merged #755. It needs a bit of a write-up, but the final result should be in-line with the parameters set out earlier in this thread. We guarantee that any UUID produced by Uuid::now_v7()
will sort later than any previously generated UUID in the same process.
Per the new draft "With this method rand_a section of UUIDv7 SHOULD be utilized as fixed-length dedicated counter bits that are incremented by one for every UUID generation."
That's in the "Fixed-Length Dedicated Counter Bits (Method 1)" section under https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04#monotonicity_counters which is directly linked from https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04#section-5.2 under the rand_a explanation.
To my reading, that means it should have a counter, like v6 (and v1), but instead the current v7 builder implementation https://docs.rs/uuid/latest/src/uuid/v7.rs.html#47-53 is just random for rand_a. Am I missing something on this one? If not, and this is a desirable feature, I might look into putting up a PR to add the counter support.