Bug: Issues index dtype=i8 with Inner Product Metrics

keshusharmamrt commented 6 months ago

Describe the bug

Hi, I am using uSearch with Integer 8 index dtype. I am having 928K embedding which I try to add to both i8 and float16 index with Inner Product Metrics and then search 10 starting embedding to get top 10 matches. I see with float16 I get same vector getting recognised in top 10 but with I8 I didn't get them. Also when I change metrics from MetricKind.Inner Product to MetricKind.L2sq those vectors starts coming but using this the distances are getting scaled differently(distances for IP are of order 0.61365813 and for l2sq 4517 maybe these are also because of i8) which I am not able to understand. Also adding embedding to MetricKind.InnerProduct with I8 dtype is too slow.

Steps to reproduce

I use following code to create and populate index with I8 dtype:-

searcher = Index(
    ndim=1024,
    metric=MetricKind.InnerProduct,
    dtype="i8",
    connectivity=32,
    expansion_add=128,
    expansion_search=64,
    multi=False,
)

searcher.add(range(len(embeddings)), embeddings, log=True)
matches = searcher.search(embeddings[:10], 10)
print(matches.keys)

I get following results:

when I only change metric=MetricKind.L2sq I start getting the expected embeddings (same index as first key). Note:- With other dtype like f16,f32 MetricKind.InnerProduct works completely fine.

Expected behavior

Ideally I should get something like this with I8+ MetricKind.InnerProduct The following is result I get with MetricKind.L2sq:

USearch version

2.11.7

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

keshusharmamrt@gmail.com

Are you open to being tagged as a contributor?

[ ] I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

[X] I have searched the existing issues

Code of Conduct

[X] I agree to follow this project's Code of Conduct

ashvardanian commented 6 months ago

Hi @keshusharmamrt! Can you please try the Cosine distance and share the dtype and shape of embeddings?

keshusharmamrt commented 6 months ago

hi @ashvardanian , Thanks for prompt reply. I tried cosine distance too but it also seems to not work perfectly as I was getting keys via MetricKind.L2sq. Also using this approach all the distances I am getting are near about zero. Sharing screenshot for same.

The Details of embeddings are as follows:

ashvardanian commented 6 months ago

I think the issue might be in the distribution that that produces those embeddings. USearch has no learned quantization, all it does is linear scaling from [-1.0, 1.0] to integers in [-100, 100].

If you know that your values won't properly scale into that range, you may want to somehow normalize them or just directly convert to int8 and later pass to USearch. That should work 🤗

PS: I'd appreciate a PR for the main README, if you think it makes sense to list it there 🤗

keshusharmamrt commented 6 months ago

Hi, Thanks for your suggestion but It doesn't seems to be working too.

Sharing the Example Code snippet which demonstrates the issue(using random uniform embeddings to demonstrate this)

import numpy as np
from usearch.index import Index, MetricKind

np.set_printoptions(formatter={"float": "{: 0.4f}".format})
np.random.seed(42)
num_centroids = 200
num_examples = 100
num_embeddings = num_examples * num_centroids
dim = 1024

centroids = np.random.uniform(low=-1, high=1, size=(num_centroids, dim))
offsets = np.random.uniform(low=-0.05, high=0.05, size=(num_embeddings, dim))
centroids = np.tile(centroids, (num_examples, 1))
embeddings = centroids + offsets
embeddings = embeddings / np.linalg.norm(embeddings, axis=1)[:, None]

for dtype in ["f32", "i8"]:
    print("Running test for dtype = ", dtype)
    for metric in [MetricKind.IP, MetricKind.Cosine, MetricKind.L2sq]:
        print("Running test ", dtype, metric)
        searcher = Index(
            ndim=dim,
            metric=metric,
            dtype=dtype,
            connectivity=32,
            expansion_add=128,
            expansion_search=64,
        )

        searcher.add(range(len(embeddings)), embeddings, log=True)
        matches = searcher.search(embeddings[:5], 4)

        print("Keys = \n", matches.keys)
        print("Distances = \n", matches.distances)

        exact_distances = 1 - (embeddings[:5] @ embeddings.T)
        exact_distances.sort(axis=1)
        if metric == MetricKind.L2sq:
            exact_distances = 2 * exact_distances

        print("Exact Distances = \n", exact_distances[:, :4])

I get following observations by running above code:-

for f32 we see that Distances are equal to Exact Distances
for i8 - we get completely different values for each metric and differ by large margin. Also for int8 L2sq and Cos seem to return similar order.
for i8 with Inner Product the keys are ordered incorrectly, distances are negative.
We use L2 normalized vectors so IP and Cos metrics should give the same results Maybe this code snippet can help understand issues we are facing with Integer 8 dtype.

adolfogc commented 6 months ago

Is this expected when quantizing? If one is quantizing the embeddings, shouldn't one prefer the Hamming distance?:

Binarization is to compress a full-precision weight into a binary code that only occupy 1 bit. It is widely used in the embedding-based similarity search of recommender systems [28, 54], since the binary embeddings have two distinct advantages compared to the full-precision ones: (1) less memory or disk cost for storing embeddings; (2) higher inference efficiency as the similarity (i.e., inner product) between binary embeddings can be calculated more efficiently through the Hamming distance, which has been proved in [73]

Reference: Li, S., Guo, H., Tang, X., Tang, R., Hou, L., Li, R., & Zhang, R. (2024). Embedding Compression in Recommender Systems: A Survey. ACM Computing Surveys, 56(5), 1-21.

Expanding on @keshusharmamrt example code snippet:

from sentence_transformers.quantization import quantize_embeddings

embeddings_f32 = embeddings.astype(np.float32)
embeddings_i8 = quantize_embeddings(embeddings_f32, precision="binary")

searcher1 = Index(
    ndim=embeddings_f32.shape[1],
    metric="hamming",
    dtype="i8",
    connectivity=32,
    expansion_add=128,
    expansion_search=64,
)
print("dim: ", embeddings_f32.shape[1])
searcher1.add(range(len(embeddings_f32)), embeddings_f32)

searcher2 = Index(
    ndim=embeddings_i8.shape[1],
    metric="hamming",
    dtype="i8",
    connectivity=32,
    expansion_add=128,
    expansion_search=64,
)
print("dim: ", embeddings_i8.shape[1])
searcher2.add(range(len(embeddings_i8)), embeddings_i8)

query = embeddings_f32[:5]
query_i8 = quantize_embeddings(query, precision="binary")

matches1 = searcher1.search(query, 4)
print("int8 quantization:")
print("Keys = \n", matches1.keys)
print("Distances = \n", matches1.distances)

matches2 = searcher2.search(query_i8, 4)
print("binary quantization:")
print("Keys = \n", matches2.keys)
print("Distances = \n", matches2.distances)

Results:

dim:  1024
dim:  128
int8 quantization:
Keys = 
 [[    0 16000 12400   400]
 [    1 11801 14801  4801]
 [    2  2402  7002 19002]
 [    3  1403  1003  8603]
 [    4 17604  6404 14004]]
Distances = 
 [[ 0.0000  309.0000  329.0000  343.0000]
 [ 0.0000  346.0000  377.0000  382.0000]
 [ 0.0000  328.0000  332.0000  341.0000]
 [ 0.0000  362.0000  366.0000  371.0000]
 [ 0.0000  334.0000  340.0000  354.0000]]
binary quantization:
Keys = 
 [[    0 18000  1800 16600]
 [    1 17601 10001 18201]
 [    2  3002  8002 12402]
 [    3  7603 13003 16403]
 [    4  8804 12604  2604]]
Distances = 
 [[ 0.0000  9.0000  10.0000  12.0000]
 [ 0.0000  12.0000  12.0000  12.0000]
 [ 0.0000  13.0000  14.0000  14.0000]
 [ 0.0000  5.0000  7.0000  8.0000]
 [ 0.0000  7.0000  8.0000  8.0000]]

Note that each query element is included in the corresponding matches and that $d(x,x) = 0$.

ashvardanian commented 6 months ago

@keshusharmamrt, your points are extremely close to each other to differentiate them with a simple scale-up. You'd need a custom quantization scheme. I'd recommend to stick to f16. I've cross-validated the output keys with a slightly extended script:

import numpy as np
from usearch.index import Index, MetricKind

np.set_printoptions(formatter={"float": "{: 0.4f}".format})
np.random.seed(42)
num_centroids = 200
num_examples = 100
num_embeddings = num_examples * num_centroids
dim = 1024

centroids = np.random.uniform(low=-1, high=1, size=(num_centroids, dim))
offsets = np.random.uniform(low=-0.05, high=0.05, size=(num_embeddings, dim))
centroids = np.tile(centroids, (num_examples, 1))
embeddings = centroids + offsets
embeddings = embeddings / np.linalg.norm(embeddings, axis=1)[:, None]

for dtype in ["f32", "f16", "i8"]:
    print("Running test for dtype = ", dtype)
    for metric in [MetricKind.IP, MetricKind.Cosine, MetricKind.L2sq]:
        searcher = Index(
            ndim=dim,
            metric=metric,
            dtype=dtype,
            connectivity=32,
            expansion_add=128,
            expansion_search=64,
        )
        print("Running test ", dtype, metric, searcher.hardware_acceleration)

        searcher.add(range(len(embeddings)), embeddings, log=True)
        matches = searcher.search(embeddings[:5], 4)

        print("Keys = \n", matches.keys)
        print("Distances = \n", matches.distances)

        exact_distances = 1 - (embeddings[:5] @ embeddings.T)
        exact_keys = exact_distances.argsort(axis=1)
        exact_distances.sort(axis=1)
        if metric == MetricKind.L2sq:
            exact_distances = 2 * exact_distances

        print("Exact Keys = \n", exact_keys[:, :4])
        print("Exact Distances = \n", exact_distances[:, :4])

ashvardanian commented 6 months ago

@adolfogc, sadly the functionality of quantize_embeddings isn't enough to account for such small perturbations. It would work very poorly.

To quantize float32 embeddings to binary, we simply threshold normalized embeddings at 0: if the value is larger than 0, we make it 1, otherwise we convert it to 0.

I also believe, that you should may have wanted to use a different precision argument for binary indexes. It's not mentioned on the main page, but is listed in the docs - b1. Here is the updated snippet:

import numpy as np
from usearch.index import Index, MetricKind

np.set_printoptions(formatter={"float": "{: 0.4f}".format})
np.random.seed(42)
num_examples = 100
num_embeddings = num_examples * num_centroids
dim = 1024

embeddings = np.random.uniform(low=-1, high=1, size=(num_embeddings, dim))

from sentence_transformers.quantization import quantize_embeddings

embeddings_f32 = embeddings.astype(np.float32)
embeddings_b1 = quantize_embeddings(embeddings_f32, precision="binary").astype(np.uint8)

searcher_f32 = Index(
    ndim=dim,
    metric="cos",
    dtype="f32",
)
print("dim: ", embeddings_f32.shape[1])
searcher_f32.add(range(len(embeddings_f32)), embeddings_f32)

searcher_b1 = Index(
    ndim=dim,
    metric="hamming",
    dtype="b1",
)
print("dim: ", embeddings_b1.shape[1])
searcher_b1.add(range(len(embeddings_b1)), embeddings_b1)

queries_f32 = embeddings_f32[:5]
queries_b1 = embeddings_b1[:5]

matches_f32 = searcher_f32.search(queries_f32, 4)
print("Original:")
print("Keys = \n", matches_f32.keys)
print("Distances = \n", matches_f32.distances)

matches_b1 = searcher_b1.search(queries_b1, 4)
print("Binary:")
print("Keys = \n", matches_b1.keys)
print("Distances = \n", matches_b1.distances)

I think I should update the main README snippet. Will mark you as a co-author 🤗

adolfogc commented 6 months ago

I also believe, that you should may have wanted to use a different precision argument for binary indexes. It's not mentioned on the main page, but is listed in the docs - b1. Here is the updated snippet:

@ashvardanian thank you for pointing that out! In that case, I think the snippet could use the ubinary precision provided by sentence_transformers:

quantize_embeddings(embeddings_f32, precision="ubinary")

Defined like so: https://github.com/UKPLab/sentence-transformers/blob/de3b29890344139d7b0e582cc7e00f18b98dc3d3/sentence_transformers/quantization.py#L380

I thought that since the int8 dimension was "lower", the saved index would be smaller in size, but it turned out that using b1 also compressed the serialized index.

Also, perhaps this (using b1 and ubinary) could be suggested as an improvement to the semantic_search_usearch function of sentence_transformers, here:

https://github.com/UKPLab/sentence-transformers/blob/de3b29890344139d7b0e582cc7e00f18b98dc3d3/sentence_transformers/quantization.py#L238

What do you think?

BTW, superb library, thank you for making it available!

kmkolasinski commented 6 months ago

Hey @ashvardanian , @keshusharmamrt has prepared very simple example which shows that for the same data when using different metrics and int8 quantization we are getting unexpected results. It looks that there is a bug in the int8 quantization (maybe only for python client). You can see it that for dot metric the order of keys is random, but for l2sq is fine, but values of distances are too large. Everything works as expected for f16 and f32 quantization. We also compared usearch with faiss and qdrant for the same HNSW parameters and dot metric of normalized vectors.

Connectivity	Expansion Add	Expansion Search
32	128	64

Our results for 1M of embeddings of size 1024

Method	Recal@1
usearch f16	0.98672
usearch int8	0.91227
qdrant int8	0.96034
qdrant int8 (with rescore)	0.98555
faiss int8	0.98278

For faiss we use the index setup from usearch benchmark repository

As you can see the results for usearch int8 are far from faiss and qdrant equivalents.

ashvardanian commented 6 months ago

https://github.com/UKPLab/sentence-transformers/blob/de3b29890344139d7b0e582cc7e00f18b98dc3d3/sentence_transformers/quantization.py#L238

What do you think?

@adolfogc, sure, that makes a lot of sense! I've proposed the patch.

ashvardanian commented 6 months ago

@kmkolasinski and @keshusharmamrt thank you for benchmarks and comparisons, those are very helpful! Can I also ask you to report the index.hardware_acceleration output, the USearch version, as well as the OS and the CPU model? Depending on those parameters, I'll know if one of the following kernels from SimSIMD is used:

https://github.com/ashvardanian/SimSIMD/blob/4b6a0411dc807eae3749b2702acaa1e3d4bad020/include/simsimd/spatial.h#L315-L361

https://github.com/ashvardanian/SimSIMD/blob/4b6a0411dc807eae3749b2702acaa1e3d4bad020/include/simsimd/spatial.h#L669-L734

https://github.com/ashvardanian/SimSIMD/blob/4b6a0411dc807eae3749b2702acaa1e3d4bad020/include/simsimd/spatial.h#L968-L1008

ashvardanian commented 6 months ago

On older versions I've made this mistake.

    // Unfortunately we can't use the `_mm512_dpbusd_epi32` intrinsics here either,
    // as it's assymetric with respect to the sign of the input arguments:
    //      Signed(ZeroExtend16(a.byte[4*j]) * SignExtend16(b.byte[4*j]))
    // So we have to use the `_mm512_dpwssd_epi32` intrinsics instead, upcasting
    // to 16-bit beforehand.
    ab_i32s_vec = _mm512_dpwssd_epi32(ab_i32s_vec, a_vec, b_vec);
    a2_i32s_vec = _mm512_dpwssd_epi32(a2_i32s_vec, a_vec, a_vec);
    b2_i32s_vec = _mm512_dpwssd_epi32(b2_i32s_vec, b_vec, b_vec);

keshusharmamrt commented 6 months ago

hi, @ashvardanian ,Thanks for your support. Sharing the Details as you asked. index.hardware_acceleration Output:- Usearch Version = 2.12.0 OS =Ubuntu 22.04.4 LTS CPU Model:- 13th Gen Intel® Core™ i5-13600HX × 20

ashvardanian commented 6 months ago

That means, this kernel is used. I've just tried implementing a few tests for it on random data, but can't yet see a noticeable difference in distance calculations between that and serial code. Any chance you have an example of two arrays for which you get a big error, calling:

from simsimd import cos
cos(a, b)

kmkolasinski commented 6 months ago

Hey, we will check it later. I don't know how it is later used in usearch, but we see the problem with dot distance, not cos. However I don't see dot implementation there. The reason we use dot is that our vectors are already normalized, so cos and dot should provide same results.

ashvardanian commented 6 months ago

On integer representations like i8 the conventional notion of normalization won't apply, as there are no values between 0 and 1. So anything that computes a dot-product of integer arrays would still need to renormalize them. USearch will swap ip with cos under the hood.

keshusharmamrt commented 6 months ago

hi, I also Tried to use cos for bench marking using usearch with f16 and i8 (here too I see with i8 performance is bad) HNSW config:

Connectivity	Expansion Add	Expansion Search
32	128	64

Sharing Entire Comparison we did for 1M of embeddings of size 1024

Method	Recall@1
usearch f16(ip)	0.98672
usearch i8(l2sq)	0.91227
usearch f16(cos)	0.98244
usearch i8(cos)	0.76728
qdrant int8	0.96034
qdrant int8 (with rescore)	0.98555
faiss int8	0.98278

Clearly with i8 the Usearch Metrics are poor only l2sq gives some reasonable one which too are lesser than other approaches

kmkolasinski commented 6 months ago

Hi, @ashvardanian if int8 requires some special treatment from the user side, could you provide us example on how do it or maybe there are some docs for it ? All libraries we have tried so far do the quantization under the hood, so things like bucketing, scaling etc is handled internally by the lib.

kmkolasinski commented 6 months ago

using other words, I wonder what we can do on our side to get similar results as other frameworks for int8 ? because there are two options, either we are doing something wrong or there is indeed some problem with int8 quantization.

keshusharmamrt commented 6 months ago

hi @ashvardanian, do you have any update regarding this?

ashvardanian commented 6 months ago

A bit overwhelmed by other USearch-related projects, and haven’t had the chance to investigate deeper, @keshusharmamrt 😔

All libraries we have tried so far do the quantization under the hood, so things like bucketing, scaling etc is handled internally by the lib.

It’s true, most other libraries come with a lot of options for different quantization schemes. We don’t currently aim to enforce any of them or implement them in the C++ layer. But maybe we can add a couple of simple quantization schemes as utilities in the Python layer for now, Any chance you can contribute, @kmkolasinski?

kmkolasinski commented 6 months ago

Hi, I can't promise it. The results are weird and at this moment it looks like there is some bug.

ashvardanian commented 6 months ago

We can check for bugs in the SimSIMD kernels, but they may already be sufficiently covered with tests.

kmkolasinski commented 6 months ago

ok, last question: let's say I would like to use int8 quantization and dot product metric. Are there some requirements for vectors to make it work properly ? do you have some example code which we could use and benchmark on our data for such scenario ?

ashvardanian commented 6 months ago

Sure, the most obvious benchmark/example would be to generate random integers (instead of quantizing random floats), constructing the index, and measuring self-recall. Thats should work fine.

kmkolasinski commented 5 months ago

Hi @ashvardanian @kimihailv we tried last thing. We took the benchmarking script from https://github.com/unum-cloud/usearch-benchmarks/tree/main and we tried to compare the metrics on a smaller dataset which is base.10M.fbin. Here is what we got:

We though we can use groundtruth.10K.ibin to verify metrics together with query.10K.fbin and base.10M.fbin, but it seems groundtruth.10K.ibin are not compatible with these query and train vectors.
The repository only points to 1B dataset which is too large for us to process. Unfortunately, we couldn't manage to find 100M data which are mentioned in the usearch benchmarking blog post.
The vectors in query.10K.fbin are different from the official ones query.public.10K.fbin from the https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search. Which means that query vectors has been selected, which raise various questions in our minds. For example, how we can reproduce this blog post (https://www.unum.cloud/blog/2023-11-07-scaling-vector-search-with-intel) metrics ?
Also, there is a bug in the recall computation in the benchmarking code. In the run.py file groundtruth vector is of shape [10_000, 1] (https://github.com/unum-cloud/usearch-benchmarks/blob/main/run.py#L49)
```
groundtruth = load_matrix(groundtruth_path)[:, :1]
```
later you use these groundtruth indices to compute recall https://github.com/unum-cloud/usearch-benchmarks/blob/main/run.py#L82
```
recall_at_one = recall(nn_ids, groundtruth, ats=(1,))
```
but this should be replaced with
```
recall_at_one = recall(nn_ids, groundtruth[:, 0], ats=(1,))
```
or recall function should be fixed, as it adds new axis to the tensor causing a weird behaviour. Small example which demonstrates the issue
We tried to reproduce the results using base.10M.fbin vectors, but we are getting the same conclusions as when using our data. Integer 8 quantization is about 10% worse than f16. Here is the code you can use to verify it by yourself:
```
from tqdm import tqdm
import numpy as np
from usearch.index import Index
from usearch.io import load_matrix
from utils.metrics import recall
```

train_embeddings = load_matrix("data/base.10M.fbin")[:500_000] query_embeddings = load_matrix("data/query.10K.fbin")

def best_match(query): return (train_embeddings @ query[:, None]).ravel().argmax()

groundtruths = [] for query in tqdm(query_embeddings, desc="Computing ground truths"): best_index = best_match(query) groundtruths.append(best_index)

groundtruths = np.array(groundtruths)

print("Benchmark f16 index") i_half = Index(ndim=96, metric="cos", dtype="f16") i_half.add(None, train_embeddings, log=True) result = i_half.search(query_embeddings, 1, log=True) recall_at_one = recall(result.keys, groundtruths, ats=(1,)) print(f" >> recall@1 = {recall_at_one[0]:.5f}")

print("Benchmark i8 index") i_quarter = Index(ndim=96, metric="cos", dtype="i8") i_quarter.add(None, train_embeddings, log=True) result = i_quarter.search(query_embeddings, 1, log=True) recall_at_one = recall(result.keys, groundtruths, ats=(1,)) print(f" >> recall@1 = {recall_at_one[0]:.5f}")

After running this code you will see

Benchmark f16 index Add: 100%|██████████| 500000/500000 [00:17<00:00, 28269.71vector/s] Search: 100%|██████████| 10000/10000 [00:00<00:00, 33128.90vector/s]

recall@1 = 0.97220 Benchmark i8 index Add: 100%|██████████| 500000/500000 [00:11<00:00, 42828.76vector/s] Search: 100%|██████████| 10000/10000 [00:00<00:00, 51415.34vector/s] recall@1 = 0.86640

I think that this is a clear evidence that there is a bug in the int8 index, and we cannot simply solve it by adding some customer wrapper for index class. Hope this helps somehow.

kmkolasinski commented 5 months ago

@ashvardanian final remarks and conclusions after spending more time on this issue:

Indeed scaling vectors values to range -1 and 1 improves recall for the test vectors of dim=96. Something which bothers me is that with this approach we are loosing some potential information as int8 range is (-128, 127) but usearch only uses (-100, 100), with this truncation we drop about 20% of potential information while bucketing. This 20% maybe important in some cases.
I think it would be nice if this -1, 1 scaling should be explicitly mentioned in the docs or readme, maybe usearch could do some validation of the input vectors to warn the user, that the vectors are not scaled optimally.
There is also one potential problem with int8 quantization, which can be suprising for unaware users i.e. if vectors values are larger than +/-1 quantization will create overflow issues. Here is an example: you can imagine that the value flip can introduce some dramatic changes in the distance values.
It looks like there is indeed a bug in the "Inner Product" metric, as according to documentation and the code IP should compute 1 - dot(a, b), but when testing it locally it returns different values, here is an example and different results for f16

From this you can see that IP is actually computing dot(a, b) / ||a|| ||b|| . This also explains why IP metric is producing terrible recall as the code expects the distance not similarity measure. However, I don't see where it is defined in the code, I found this in the code: https://github.com/unum-cloud/usearch/blob/5ea48c87c56a25ab57634a8f207f80ae675ed58e/include/usearch/index_plugins.hpp#L977 which seems to be correct. I use usearch version 2.12.0.

Regarding my previous comment, please check the benchmarking scripts and the recall function. You can easily check that for incorrect groudtruth shape array it will overestimate recall@1 by large marging. This is especially visible for int8.

ashvardanian commented 2 months ago

@keshusharmamrt @adolfogc @kmkolasinski, I believe this issue is resolved now. I've extended the docs to highlight the improved behavior. Let me know if the issue persists 🤗

kmkolasinski commented 2 months ago

okey sure and thanks let's close it, in the case of problems we will create new ticket :+1:

unum-cloud / usearch