CRC64 Jones polynomial with improvements

Pulled from most recent Valkey, removed Valkey-specific parts
53-73% faster for crc64_jones vs crc64_jones1 on Xeon 2670 v0 @ 2.6ghz
2-2.5x faster for crc64_jones vs crc64_jones1 on Core i3 8130U @ 2.2 ghz
1.6-2.46 bytes/cycle on i3 8130U
likely >2x faster than crcspeed on newer CPUs with more resources than a 2012-era Xeon 2670
crc64 combine function runs in <50 nanoseconds typical with vector + cache optimizations (~8 microseconds without vector optimizations, ~80 *microseconds without cache, the combination is extra effective)
still single-threaded
Variations of crccombine.c available (for non-intel arch starting points): https://github.com/josiahcarlson/redis/commit/55642fea796b14a7f58b923f0900447d2cf00968#diff-046412072aa4e87484754f261116ea6501d2350747ac1379530da9584834efdd

Tried to make as few changes as possible to both the upstream code, as well as smhasher. This seemed to be a good balance.

rurban / smhasher