Comparing `__uint128_t`s generates significantly better code than comparing pairs of `uint64_t`s

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

28k stars 11.56k forks source link

Comparing `__uint128_t`s generates significantly better code than comparing pairs of `uint64_t`s #108418

Open pkasting opened 5 days ago

pkasting commented 5 days ago

See https://godbolt.org/z/1ajsxs8n5 for examples.

On x86-64 -O2, any implementation of comparing pairs of uint64_ts seems to generate noticeably worse code (along all axes) compared to using single __uint128_ts. Seems like the former could be optimized into a form close to the latter in at least many cases.

llvmbot commented 5 days ago

@llvm/issue-subscribers-backend-x86

Author: None (pkasting)

See https://godbolt.org/z/1ajsxs8n5 for examples. On x86-64 -O2, any implementation of comparing pairs of `uint64_t`s seems to generate noticeably worse code (along all axes) compared to using single `__uint128_t`s. Seems like the former could be optimized into a form close to the latter in at least many cases.

sesse commented 5 days ago

Same thing for Aarch64, so I believe this isn't architecture-specific.

Basically what it's testing is a.x < b.x + (a.y < b.y), except that if you write that in C yourself, you'll be hit by integer wraparound if b.x == UINT64_MAX. (It's really surprising to me that this works; you would think that you can't just randomly add 1 sometimes to the right side of the equation, but the only case where it actually matters is indeed when a.x == b.x and that's precisely when you want it.)

pkasting commented 1 day ago

Clarity: The godbolt link above assumes the most significant 64 bit word is stored in the first member of the pair/struct, regardless of CPU endianness, so that lexicographic comparison order on the pair still gives the equivalent result to arithmetic comparison on a corresponding __uint128_t. That is, this code is not suitable for e.g. "type pun this __uint128_t in memory to the struct and then run the relevant code on it". You could write code that was suitable for that, of course.