Open xfalcox opened 3 weeks ago
hi @xfalcox, great question. The problem with the embedding you suggest is tha 01
XOR 10
= 11
which means a distance of 2 between regions that are next to each other. I'd want a distance of 1 between each region next to each other and I couldn't figure out how to do that in 2 bits with 4 regions.
But, maybe I am missing something. Open to suggestions.
If use 2 bit for each dimension, the compression ratio is 16x instead of 32x?
@cevian what happens in your 3 region split when you have 00 and 11, i.e. the first and last regions, the xor is then 11. Do you count this as a distance of 3 or 2 or something else?
@cevian to split into 3, you can convert 00, 01, 10, 11 to 000, 001, 011, 111 before doing the XOR and then do the xor and count bits as usual to calculate the distance. Or store these as 3 bit values from the beginning and use more space to save on computations.
@cduk yes the 3 region encoding has 00 and 11 as the first and last region and the distance there is 2. This is by design, as we want regions further away to have greater distance. So all adjacent regions have a distance of 1, with the distance increasing as the regions spread out.
I was reading https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/ and your work on Statistical binary quantization is great, kudos for it!
One thing that caught my attention is that, from my understanding of the article, you are dividing the space on each dimension as follows:
Since storing it on two bits means you have four possible states, storing just 3 seems like we are missing an opportunity to store an extra state "for free" and further increase recall.
Have you considered making it work like:
Anyway, thanks for your contribution to the field. At @discourse we are using embeddings to power "Related Topics" feature on every forum we host, and I'm super interested on the possible gains on disk space saving when storing massive amounts of embeddings and binary embeddings showed great potential for us so far, see https://github.com/pgvector/pgvector/issues/521#issuecomment-2061634539.