microsoft / ALEX

A library for building an in-memory, Adaptive Learned indEX
MIT License
657 stars 111 forks source link

BUGs in Zipfian generator #34

Open Losk-x opened 1 week ago

Losk-x commented 1 week ago

ZETAN is not a constant

https://github.com/microsoft/ALEX/blob/4370da6aa8b509fdc9b0d2c49faa0624b0078589/src/benchmark/zipf.h#L10

Though this line is directly copied from ycsb, it neglects the fact that ZETAN is determined by the numkeys and ZIPFIAN_CONSTANT. In ycsb's implementation, the numkeys is also fixed, so the ZETAN can be a constant. See:

https://github.com/brianfrankcooper/YCSB/blob/dbe92a1fbd24be061fcf01e56838afb83e6f9c58/core/src/main/java/site/ycsb/generator/ScrambledZipfianGenerator.java#L35

https://github.com/brianfrankcooper/YCSB/blob/dbe92a1fbd24be061fcf01e56838afb83e6f9c58/core/src/main/java/site/ycsb/generator/ZipfianGenerator.java#L244

Hash function's keyspace may be too small

https://github.com/microsoft/ALEX/blob/4370da6aa8b509fdc9b0d2c49faa0624b0078589/src/benchmark/zipf.h#L60

Using a fnv1a-32 is not consistent with yscb (fnv1a-64). It may cause hash collision and make keyspace not fully-covered when numkeys is closed with 2^32-1.

ZEMINGMA commented 1 week ago

已收到,感谢来信。祝好!

Losk-x commented 1 week ago

Test result of the wrong generator in 200M dataset: 10B keys are generated, and only 62.5% can be covered.