matsui528 / nanopq

Pure python implementation of product quantization for nearest neighbor search
MIT License
323 stars 43 forks source link

Number of codewords for quantization #20

Closed samanemami closed 2 years ago

samanemami commented 2 years ago

How can we determine the ideal value for the Ks parameter?

matsui528 commented 2 years ago

Short answer: Ks=256.

Long answer: There are two hyperparameters: M and Ks. The best practice is to fix Ks=256 and change M If you need more precise encoding. For example, if you have 100-dim vectors, you can first try M=4 with Ks=256. Each code will take M * log2(Ks) = 4 * 8 = 32 bits. If the result is not satisfactory, you can try M=5 with Ks=256, resulting in 5 * 8 = 40 bits.

Note that, if you set Ks=256, each code is represented by uchar8, meaning that you can fully make use of bits without waste. If you set Ks=512, for example, then each code is represented by uchar16. Here, you're wasting bits because we can represent 2**16=65536 possibilities if we use uchar16. In that sense, if you don't want to waste your bits, there could be only three options for Ks: Ks=2**8 (uchar8), Ks=2**16 (uchar16), and Ks=2**32 (uchar32).

In advanced literature in an academic field, one can use Ks != 256, e.g., https://openaccess.thecvf.com/content_cvpr_2018/html/Douze_Link_and_Code_CVPR_2018_paper.html. But in a real-world use case, I can suggest that Ks=256 always works the best.

samanemami commented 2 years ago

Thank you so much for the complete and comprehensive explanation.