Closed kaancfidan closed 11 months ago
I just realized the changes in C implementations broke qdrant windows build, I'm trying to understand why cl.exe can't build these.
I had seen the l1_avx is not used warning, but it is being used in the benchmarks. I could remove the benchmarks and the implementation of course but do you have a better idea?
I have realized that this L1 distance is currently not compatible with binary quantization. I had followed mostly what had been done for L2, but I suspect L2 is also incompatible with it.
Edit: On second thought, I think it may already be compatible as the current code uses XOR in case of invert = true and the truth tables match perfectly (as it's the actual Hamming distance in the case of L1)
A | B | Dot product | L1 |
---|---|---|---|
-0.5 | -0.5 | 0.25 | 0 |
-0.5 | 0.5 | -0.25 | 1 |
0.5 | -0.5 | -0.25 | 1 |
0.5 | 0.5 | 0.25 | 0 |
A | B | NXOR | XOR |
---|---|---|---|
0 | 0 | 1 | 0 |
0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 1 | 1 | 0 |
I have fixed both L1 and L2 distances and added tests for both in case of binary quantization.
I have fixed both L1 and L2 distances and added tests for both in case of binary quantization.
I have run a benchmark with real data using Euclid + binary quantization (using qdrant itself). It did not work well. I need to investigate and understand the difference between the toy test data and the real one.
I'll also check L1 tomorrow. Meanwhile, I have set the status to draft again.
I have fixed both L1 and L2 distances and added tests for both in case of binary quantization.
I have run a benchmark with real data using Euclid + binary quantization (using qdrant itself). It did not work well. I need to investigate and understand the difference between the toy test data and the real one.
I'll also check L1 tomorrow. Meanwhile, I have set the status to draft again.
I have been banging my head on this all day.
I am pretty sure L1 and L2 distances and inverses (similarities) are calculated properly. I have gone through unit tests, added integration tests that use binary and scalar quantizations on qdrant...etc. they all work as I expect.
Nevertheless, when I run our benchmark image vector dataset with this implementation, I still get 0% accuracy with L1 and L2 using binary quantization while dot product gets about 84%.
I will let it go if I can justify it as "our problem is just not compatible with binary quantization" but the fact that dot product works annoys me to no end. I am open to suggestions.
OK, I had an epiphany where I understood that the binary calculate_metric
method does not need to actually calculate the correct metric (dot, L1 or L2) but just has to approximate their behavior and put the vectors in the same order. The actual score is being calculated when rescoring anyway.
I have relaxed the L1 and L2 binary tests to just check the similarity order but not care about the actual score values. This way I could see that the previous implementation with dot product is still on point for both L1 and L2, but it was just inverse of what it was supposed to be. The many layers of inversing confused me too.
I apologize to the reviewers that they had to go through my learning experience with me. :)
I will run the large benchmark with this latest version as soon as I can build a docker image.
edit: aaaand we have a winner! 🍾
It's a great work! I really wondered! Impressed with correct scalar quantization alpha
and multiplier
values and fixed for non-L1 related bugs. I just added some requests for C code. Everything else looks perfect!
This pull request is related to https://github.com/qdrant/qdrant/issues/3052.
It implements L1 distance type to be used for the matching Manhattan distance implementation in qdrant/qdrant.
I have tried to make the changes as non-disruptive as possible, but since L1 distance cannot be related to dot product as easily as L2, there had to be some branching between scoring implementations.
I have added tests and benchmarks and verified the implementation works for simple, AVX, SSE and Neon instructions.