KSUIDs do not sort properly.

segmentio / ksuid

K-Sortable Globally Unique IDs

MIT License

4.82k stars 174 forks source link

On the Linux command line and in PostgreSQL, ksuids do not sort properly.

2SmasRGkif9qETAHwssB4GGYwdican be converted to the timestamp 2023-07-19 04:13:36 -0400 EDT
2STcKMQ7MGgoDxpX9A5WuaJlx8V can be converted to the timestamp 2023-07-12 10:59:06 -0400 EDT

However, PostgreSQL and the Linux sort command both sort 2SmasRGkif9qETAHwssB4GGYwdi lower than 2STcKMQ7MGgoDxpX9A5WuaJlx8V, because evidently, all lower-case letters sort before capital letters.

I'm assuming that the ksuid algorithm uses the logic that capital letters have a lower ASCII value:

Python 3.11.2 (main, May 30 2023, 17:45:26) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ord('A')
65
>>> ord('a')
97
>>>

I love the project and use ksuids as much as possible, but this issue has bitten me at work and been the cause of some subtle bugs. Is it impossible to fix in a backwards-compatible way?

There is an interesting Gist document related the problem: PostgreSQL collation is a massive footgun.

Also check this Gist comment: Functions for generating Segment's KSUIDs on PostgreSQL. They had to change the collation on their databases from en_US.UTF-8 to C.UTF-8.

The README says that "running a set of KSUIDs through the UNIX sort command will result in a list ordered by generation time". However, the UNIX sort behaviour is affected by the locale variable, according to the sort man page:

WARNING The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

I think adding a sentence in the README that specifies the correct collation can avoid this pitfall. Perhaps a warning is more effective, like the one above from the sort manual page.

segmentio / ksuid

KSUIDs do not sort properly. #77