Open ShawnMilo opened 11 months ago
There is an interesting Gist document related the problem: PostgreSQL collation is a massive footgun.
Also check this Gist comment: Functions for generating Segment's KSUIDs on PostgreSQL. They had to change the collation on their databases from en_US.UTF-8
to C.UTF-8
.
The README says that "running a set of KSUIDs through the UNIX sort
command will result in a list ordered by generation time". However, the UNIX sort
behaviour is affected by the locale variable, according to the sort
man page:
WARNING The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
I think adding a sentence in the README that specifies the correct collation can avoid this pitfall. Perhaps a warning is more effective, like the one above from the sort
manual page.
On the Linux command line and in PostgreSQL, ksuids do not sort properly.
2SmasRGkif9qETAHwssB4GGYwdi
can be converted to the timestamp2023-07-19 04:13:36 -0400 EDT
2STcKMQ7MGgoDxpX9A5WuaJlx8V
can be converted to the timestamp2023-07-12 10:59:06 -0400 EDT
However, PostgreSQL and the Linux
sort
command both sort2SmasRGkif9qETAHwssB4GGYwdi
lower than2STcKMQ7MGgoDxpX9A5WuaJlx8V
, because evidently, all lower-case letters sort before capital letters.I'm assuming that the ksuid algorithm uses the logic that capital letters have a lower ASCII value:
I love the project and use
ksuid
s as much as possible, but this issue has bitten me at work and been the cause of some subtle bugs. Is it impossible to fix in a backwards-compatible way?