Closed sergeyprokhorenko closed 1 year ago
Let me elaborate a bit. We have syscalls gettimeofday() and clock_gettime() that return real time with microsecond and nanosecond precision respectively. To get millisecond resultion we must divide microseconds by 1000 or nanoseconds by 1000000. The division instruction is expensive on many machines. Despite CSPRNG might be expensive too, I still see this division as a source of energy waste. If we compute rand_a as number of 1/1024 second intervals obtained from gettimeofday() or 1/(1024 1024) from clock_gettime(), the computation can be done with bit shift by (seconds<<10 + us>>10) instead of (seconds1000 + us/1000). Will such unix_ts_ms be in line with the standard?
one unfortunate shortcoming was revealed
What was the shortcoming? I don't understand how the new verbiage improves upon what's in the spec already.
I still see this division as a source of energy waste
The cost of a divide operation is several orders of magnitude below what we should be concerned with.
DIV r32 can cost you some tens of nanoseconds [0]. On some platforms RDTSC is cheaper than DIV[1] and total cost of gettimeofday\clock_gettime will be comparable to this avoidable math manipulation. At which point are we concerned with performance?
[0] https://www.agner.org/optimize/instruction_tables.pdf [1] https://stackoverflow.com/questions/6498972/faster-equivalent-of-gettimeofday
At which point are we concerned with performance?
When the rate at which generating UUIDs becomes an issue for people using them.
While I'll grant this is anecdotal, the NPM uuid
library typically benchmarks in the 100K-1M UUIDs/second range depending on version and platform. And RFC4122 actually imposes a hard limit of 10M uuids/second. It requires implementations to throw an error if that's exceeded, which we do. In the 10+ years I've been maintaining this project, I've only had one report of that limit being hit and that was in the context of benchmarking, not an actual production use case.
So in absolute terms I think it's safe to say that generation times on the order of 1000 nanoseconds are at the extreme limit of what's expected. In real world use cases UUIDs are generated in contexts where other more performance-intensive operations are the gating factor. Hence, why a few 10's of nanoseconds aren't a concern.
For what it's worth, the performance work we've done on uuid
has always been "academic" in nature. I.e. Performance for performance's sake, not because it's actually required. We've reached a point where I reject performance PRs on the basis of code readability and maintainability because nobody cares about performance at this level of granularity.
Here is a quote from the section 2. Motivation of the renewed standard:
The UUID generation algorithm described here supports very high allocation rates of 10 million per second per machine or more if necessary, so that they could even be used as transaction IDs.
This means a generation time of 100 nanoseconds per UUID. Therefore, tens of nanoseconds of the division operation is too much
'Taking a bit more time to figure out what it is that's being asked for here, I think I understand the issue. To answer my own question ("what was the shortcoming?") the concern seems to be that the spec doesn't explicitly allow for deliberately sloppy timestamps computations, such as dividing by 1024 instead of 1000 to convert from microseconds to milliseconds.
Assuming that's a fair synopsis, I would ask that we not take this new verbiage. The existing verbiage acknowledges that system clocks are known to be imperfect, while conveying the desire for accuracy. It's a good balance of "timestamps need to be accurate" and "timestamps are known to be inaccurate".
The proposed verbiage carries a different tone. It tacitly endorses implementations that are deliberately inaccurate, which I think is a mistake. And it does so for performance reasons that are (imho) well beyond the margins of actual user concern.
On the topic of performance ...
Here is a quote from the section 2. Motivation of the renewed standard
2. Motivation is just the original verbiage from RFC4122. It's included for (I assume) historical reference... and as it turns out, it is incorrect(!) Version 1 does not allow for "10 million per second ... or more". It allows for at most 10M/second.
[@kyzer-davis: Can you confirm the original RFC motivation is misleading in this regard? I don't see any provisions in 4122 that allow for generators to support rates greater than 1 uuid/timestamp interval. Would this be worth calling out or revising in the new spec?]
Thus, "10 million per second" isn't a requirement. It's an upper limit. And it's a theoretical limit based on the nature of the v1 (and v6 by extension) algorithms. None of the other versions have such a limit. v4, v5, v7? Those algorithms will let you crank out a trillion uuids/second if you can figure out how to do that. So should the spec concern itself with what might be needed to achieve picosecond-level optimizations?
The existing verbiage acknowledges that system clocks are known to be imperfect, while conveying the desire for accuracy.
This claim is based on nothing and is misleading. This is not in the standard. This is an irresponsible speculation.
The sole purpose of the timestamp is to order UUIDs by generation time. It does not matter how much the timestamp differs from real time. Moreover, the standard calls for distorting the timestamp when it is necessary for information security. For example, you can add or subtract several years to hide the true moment of generation. And the standard does not limit the motives for which the timestamp can be deliberately distorted.
Thus, "10 million per second" isn't a requirement. It's an upper limit.
This statement grossly contradicts the text of the standard.
This claim is based on nothing and is misleading. This is not in the standard. This is an irresponsible speculation.
Inflammatory, a little offensive, and nothing of substance to respond to.
The sole purpose of the timestamp is to order UUIDs by generation time
Than we definitely shouldn't allow non-standard time intervals. Doing so causes ordering to become a hot mess as soon as uuids from different generators start to commingle.
For example, Take a uuid generated now (2023-08-06T13:55:18.285Z
) from a generator using the proposed 1 µs/1024
time interval and insert it into a list of uuids generated with the standard 1ms
interval and it will sort as though it was generated in May of last year (2022-05-04T18:39:28.637Z
to be specific).
That's going to wreak havoc with the stated goal of better DB locality.
Thus, "10 million per second" isn't a requirement. It's an upper limit.
This statement grossly contradicts the text of the standard.
Prove it.
Here's the's v1 implementation for uuid
. It's been written to throw a hard error if callers attempt to create more than 10M uuids/second because that's what the spec demands. If we have that wrong then please put up a PR that corrects this. I would genuinely welcome that. You'd be doing the [very large] uuid
community a pretty big service by getting rid of that throw statement.
Distinctly not holding my breath on this, however.
Folks, just my 2 cents. "DB locality" == locality in one node. It's impossible to have time-ordered locality in many nodes. "10M upper limit" sounds like "640k ought to be enough for anybody". Even with 10M upper limit division instruction could take ~10% of CPU costs. Yes, in JS you do not care about it. And hardly will every reach 1B uuids per second. But performance-oriented system can get to this numbers easily. Also, think about low-power IOT devices. In that world you can substitute performance with energy expenses. It's better to save ~10% spent on such a frequent task like uuid generation.
Also, from a DB standpoint, mixing series of UUIDs with seriously different clocks wont make any impact on performance. Crucial part is that working set of new data is small. When the data is generated randomly across keyspace of whole dataset - that's a problem. If we have 13 data sources, one is dividing by 1000, another by 1002, one more by 1004 ... 1024 - this won't affect cache locality much.
I think the benefit of optimizing the generation of timestamps by avoiding division is of little relevance. The test below shows that the difference between division and shift (division by a power of 2) can be negligible compared to the duration to get the current time.
Measure the duration to get 1 billion milliseconds using div and shift separately
Test result:
user@computer:~$ gcc -O0 measure_duration_get_milliseconds.c && ./a.out
duration to get seconds (baseline): 16.958577 s
duration to get milliseconds using div: 17.956647 s
duration to get milliseconds using shift: 17.617555 s
Also, in the long run, the difference between the timestamps produced with division and shift will get bigger and bigger. Currently, the difference is more than 400 days, considering the test below.
Measure the distance between milliseconds using div and milliseconds using shift
Test result:
user@computer:~$ gcc -O0 measure_distance_get_milliseconds.c && ./a.out
timestamp using div: 1691404335877 ms
timestamp using shift: 1731998039876 ms
distance in seconds: 40593703 s
distance in days: 469 days
Assembly difference between division and shift in the tests:
user@computer:~$ diff get_milliseconds_with_div.s get_milliseconds_with_shift.s
23,33c23,27
< imulq $1000, %rax, %rsi
< movq -24(%rbp), %rcx
< movabsq $4835703278458516699, %rdx
< movq %rcx, %rax
< imulq %rdx
< movq %rdx, %rax
< sarq $18, %rax
< sarq $63, %rcx
< movq %rcx, %rdx
< subq %rdx, %rax
< addq %rsi, %rax
---
> salq $10, %rax
> movq %rax, %rdx
> movq -24(%rbp), %rax
> sarq $20, %rax
> orq %rdx, %rax
I have not been checking this repo so I apologize for missing this thread.
@broofa, the old doc said:
The UUID generation algorithm described here supports very high allocation rates of up to 10 million per second per machine if necessary, so that they could even be used as transaction IDs.
The latest new doc says:
The UUID generation algorithm described here supports very high allocation rates of 10 million per second per machine or more if necessary, so that they could even be used as transaction IDs.
The key being that it is now uncapped and 10m is just an example. I don't remember exactly when this was changed but I believe the idea is that one could create way more than this with modern computers. After-all that metric may just be something from early 2000s era computing vs some hard limit.
As for the dividing a microsecond by 1024 vs 1000 for performance reasons. I see no issue adding something to "Altering, Fuzzing, or Smearing:" sub-section which gives an example of another reason one might modify the timestamp.
Some proposed text:
Some examples include security considerations around providing a real clock value within a UUID, to correct inaccurate clocks, to handle leap seconds, or for performance reasons such as dividing a microsecond by 1024 (or some other value) instead of 1000 to obtain a millisecond value. This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.
Just to note, we just concluded IETF last call so I am not sure what changes I can realistically add that won't reset the AD reviews. Let me run this by the WG Chairs and get their opinion.
My take on this is that dividing by 1024 for performance reasons should be allowed by the spec, but not encouraged. From an interoperability perspective if something is actually using the timestamp or needs the ordering provided by whatever the best clock source is, then this is problematic. But, that said, one of the goals of this new spec is to allow implementations to make these kinds of tradeoffs - if the interoperability in terms of timestamp accuracy just isn't a concern for your application, then sure, use 1024ths of a second instead of milliseconds. But I don't think we should promote this idea in the spec because it's a slippery slope - why not change the timestamp entirely to some other numbering system, and if you did, is that still UUIDv7?
Just from a quick playground example, if you look at the difference for a specific timestamp, you can see here that https://go.dev/play/p/3_dA3xapsaR the date 2023-09-15 00:00:00
when converted to unix timestamp and divided by 1024 instead of 1000 and then converted back to a date becomes 2022-06-12 06:33:45
- over a year earlier. The timestamp becomes mostly meaningless at this point, at least in terms of interoperability.
So again, if someone wants to do this for their system and the interoperability concerns are acceptable to them because they are sure nobody cares about how the timestamp reads, or they control all of the parsing code, great. But I don't think we should be telling people to do this as advice.
How about this text instead:
Altering, Fuzzing, or Smearing: : Implementations MAY alter the actual timestamp. Some examples include security considerations around providing a real clock value within a UUID, to correct inaccurate clocks, or to handle leap seconds, or to improve performance when converting a system timestamp to the necessary value (e.g. improving UUID generation speed but reducing timestamp value accuracy). Implementations should document such timestamp behaviors in order to improve interoperability. This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.
if the interoperability in terms of timestamp accuracy just isn't a concern for your application, then sure, use 1024ths of a second instead of milliseconds. But I don't think we should promote this idea in the spec because it's a slippery slope - why not change the timestamp entirely to some other numbering system, and if you did, is that still UUIDv7?
I would argue that you should use a UUIDv8 in this case.
e.g. improving UUID generation speed but reducing timestamp value accuracy
I prefer the wording of @kyzer-davis as more specific, or the wording of @bradleypeabody needs to be clarified. The words "timestamp value accuracy" can cause controversy, as some may think that it is about minutes or hours, but not years. By the way, for security reasons, a deviation of several years is also a good opportunity. I want to remind that the only purpose of the timestamp is to ensure the generation of monotonically increasing keys to quickly search for records. But from a DB standpoint, mixing series of UUIDs with seriously different clocks wont make any impact on performance. Therefore, the interoperability will not be affected. See also: 6.11 Opacity
I agree with the suggestion to document intentional timestamp deviations.
Could be somewhat off topic: tv_sec * 1000 + tv_nsec >> 20
makes some sense to me, but tv_sec << 10 + tv_nsec >> 20
doesn't because multiplication is almost as fast as left shift. With the former approach, the resulting variance from tv_sec * 1000 + tv_nsec / 1000000
is less than 50 milliseconds, which is usually as negligible as rounding errors. Do we really need explicit wording about this technique in the spec?
Plus, from my benchmarks, under extremely high workload, the syscall overhead becomes prominent, so caching the tv_sec * 1000 + tv_nsec / 1000000
result in the user land should improve the throughput way more than eliminating divisions.
Btw, consider this approach as well: (uint64_t)tv_sec * 1000 + ((uint64_t)tv_nsec * 1073 >> 30)
1073 comes from: (2^20 / 1000^2) * (2^10 / 2^10) ~= 1073 / 2^10
Beware of tv_nsec overflows.
After some research, I learned that
(uint64_t)tp.tv_sec * 1000 + ((uint64_t)tp.tv_nsec * UINT64_C(18014398510) >> 54)
is 100% compatible with
(uint64_t)tp.tv_sec * 1000 + (uint64_t)tp.tv_nsec / 1000000
Two expressions result in exactly the same value for each tv_nsec
ranging from 0 to 999,999,999.
Plus, compilers conduct this sort of optimization automatically (see this paper for details), so when I look at a quick benchmark result, the former code is much faster than the latter with the unoptimized executable, while such a performance difference is not observable with the optimized executable.
I'd say we don't need to be concerned about divisions and can just stick to:
(uint64_t)tp.tv_sec * 1000 + (uint64_t)tp.tv_nsec / 1000000
Yup, I worried a bit about -O0 too, but thought that it makes sense. Relying on compiler optimizations did not strike me as a good idea. However, this optimization seems straightforward enough, when I saw the code in C :)
I have committed the text change under https://github.com/ietf-wg-uuidrev/rfc4122bis/commit/26f38c94afda735b42ab290b5afc84a12306634d for draft-10
Merged, and closing. Thanks group.
@kyzer-davis, Is it possible to change "such as dividing a microsecond by 1024 (or some other value) instead of 1000 to obtain a millisecond value" for "such as dividing a number of microseconds by 1024 (or some other value) instead of 1000 to obtain a millisecond value"?
As a result of the attempt to implement the standard, one unfortunate shortcoming was revealed. I propose to add the following text to the section "6.1. Timestamp Considerations" in the next version of the standard:
I believe this is now in line with the standard, as the standard says:
However, developers would like reasonable computational errors to be allowed explicitly.