Optimize PFCOUNT, PFMERGE command by SIMD acceleration

Nugine commented 1 week ago

WIP: unit tests

This PR optimizes the performance of HyperLogLog commands (PFCOUNT, PFMERGE) by adding AVX2 fast paths.

Two AVX2 functions are added for conversion between raw representation and dense representation. They are 15 ~ 30 times faster than scalar implementaion. Note that sparse representation is not accelerated.

AVX2 fast paths are enabled when the CPU supports AVX2 (checked at runtime) and the hyperloglog configuration is default (HLL_REGISTERS == 16384 && HLL_BITS == 6).

When merging 3 dense hll structures, the benchmark shows a 12x speedup compared to the scalar version.

pfcount key1 key2 key3
pfmerge keyall key1 key2 key3

======================================================================================================
Type             Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
------------------------------------------------------------------------------------------------------
PFCOUNT-scalar    5665.56        35.29839        32.25500        63.99900        67.58300       608.60
PFCOUNT-avx2     72377.83         2.75834         2.67100         5.34300         6.81500      7774.96
------------------------------------------------------------------------------------------------------
PFMERGE-scalar    9851.29        20.28806        20.09500        36.86300        39.16700       615.71
PFMERGE-avx2    125621.89         1.59126         1.55100         3.11900         4.70300     15702.74
------------------------------------------------------------------------------------------------------

scalar: valkey:unstable  2df56d87c0ebe802f38e8922bb2ea1e4ca9cfa76
avx2:   Nugine:hll-simd  8f9adc34021080d96e60bd0abe06b043f3ed0275

CPU:    13th Gen Intel® Core™ i9-13900H × 20
Memory: 32.0 GiB
OS:     Ubuntu 22.04.5 LTS

Experiment repo: https://github.com/Nugine/redis-hyperloglog Benchmark script: https://github.com/Nugine/redis-hyperloglog/blob/main/scripts/memtier.sh Algorithm: https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 52.80899% with 42 lines in your changes missing coverage. Please review.

Project coverage is 70.70%. Comparing base (2df56d8) to head (a6136ca). Report is 10 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/hyperloglog.c	52.80%	42 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## unstable #1293 +/- ## ============================================ + Coverage 70.69% 70.70% +0.01% ============================================ Files 114 115 +1 Lines 63161 63233 +72 ============================================ + Hits 44650 44710 +60 - Misses 18511 18523 +12 ``` | [Files with missing lines](https://app.codecov.io/gh/valkey-io/valkey/pull/1293?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=valkey-io) | Coverage Δ | | |---|---|---| | [src/hyperloglog.c](https://app.codecov.io/gh/valkey-io/valkey/pull/1293?src=pr&el=tree&filepath=src%2Fhyperloglog.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=valkey-io#diff-c3JjL2h5cGVybG9nbG9nLmM=) | `86.35% <52.80%> (-4.89%)` | :arrow_down: | ... and [22 files with indirect coverage changes](https://app.codecov.io/gh/valkey-io/valkey/pull/1293/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=valkey-io)

xbasel commented 1 week ago

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

Nugine commented 1 week ago

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

The algorithms are verified by comparing the results between scalar code and simd code with random input. https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

zuiderkwast commented 4 days ago

This is so cool! :sunglasses:

How was this change tested, particularly to confirm that the non-AVX2 and AVX2 implementations produce the same results?

The algorithms are verified by comparing the results between scalar code and simd code with random input. https://github.com/Nugine/redis-hyperloglog/blob/main/cpp/bench.cpp

I think we need to test this in our repo in some way. The binary representation can't change, because things like replicas will not understand it, so we should verify the binary representation.

A hyperloglog key is actually a string so we can use the GET command to get the binary representation. In a TCL test case, we can use GET and compare the reply to the binary data we have stored earlier. Alternatively we can use DUMP. Can you add it?

zuiderkwast commented 3 days ago

@lipzhu You're the performance expert. Do you want to review this?

Nugine commented 2 days ago

WIP: unit tests for verifying the results produced by non-AVX2 and AVX2 implementations.

I have other things to do these days. So it may take a while.

valkey-io / valkey

Optimize PFCOUNT, PFMERGE command by SIMD acceleration #1293

Codecov Report