timescale / pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
PostgreSQL License
610 stars 23 forks source link

Why 2 bit mode SBQ divides into 3 regions instead of 4 ? #90

Open xfalcox opened 3 weeks ago

xfalcox commented 3 weeks ago

I was reading https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/ and your work on Statistical binary quantization is great, kudos for it!

One thing that caught my attention is that, from my understanding of the article, you are dividing the space on each dimension as follows:

|            00                 |               01               |             11                |
|-------------------------------|------------ median ------------|-------------------------------|
                        (median - stdev)                 (median + stdev)

Since storing it on two bits means you have four possible states, storing just 3 seems like we are missing an opportunity to store an extra state "for free" and further increase recall.

Have you considered making it work like:

|            00                 |       01      |       10       |             11                |
|-------------------------------|------------ median ------------|-------------------------------|
                        (median - stdev)                 (median + stdev)

Anyway, thanks for your contribution to the field. At @discourse we are using embeddings to power "Related Topics" feature on every forum we host, and I'm super interested on the possible gains on disk space saving when storing massive amounts of embeddings and binary embeddings showed great potential for us so far, see https://github.com/pgvector/pgvector/issues/521#issuecomment-2061634539.

cevian commented 3 weeks ago

hi @xfalcox, great question. The problem with the embedding you suggest is tha 01 XOR 10 = 11 which means a distance of 2 between regions that are next to each other. I'd want a distance of 1 between each region next to each other and I couldn't figure out how to do that in 2 bits with 4 regions.

But, maybe I am missing something. Open to suggestions.

VoVAllen commented 3 weeks ago

If use 2 bit for each dimension, the compression ratio is 16x instead of 32x?

cduk commented 3 weeks ago

@cevian what happens in your 3 region split when you have 00 and 11, i.e. the first and last regions, the xor is then 11. Do you count this as a distance of 3 or 2 or something else?

cduk commented 3 weeks ago

@cevian to split into 3, you can convert 00, 01, 10, 11 to 000, 001, 011, 111 before doing the XOR and then do the xor and count bits as usual to calculate the distance. Or store these as 3 bit values from the beginning and use more space to save on computations.

cevian commented 3 weeks ago

@cduk yes the 3 region encoding has 00 and 11 as the first and last region and the distance there is 2. This is by design, as we want regions further away to have greater distance. So all adjacent regions have a distance of 1, with the distance increasing as the regions spread out.