Pulled from most recent Valkey, removed Valkey-specific parts
53-73% faster for crc64_jones vs crc64_jones1 on Xeon 2670 v0 @ 2.6ghz
2-2.5x faster for crc64_jones vs crc64_jones1 on Core i3 8130U @ 2.2 ghz
1.6-2.46 bytes/cycle on i3 8130U
likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670
crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 microseconds without vector optimizations, ~80 *microseconds without cache, the combination is extra effective)
Tried to make as few changes as possible to both the upstream code, as well as smhasher. This seemed to be a good balance.