nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.12k stars 609 forks source link

Cosine normalized vectors and recall benchmarking #353

Open leoplusx opened 2 years ago

leoplusx commented 2 years ago

Two questions regarding hnswlib's normalized vectors when using cosine:

Question 1:

When I benchmark my hnswlib indices for recall, I'm getting near-zero scores -- values like 3.41130717333766e-08.

The way I'm measuring recall is that I'm extracting ids and vectors from the index itself:

ids = hnsw.get_ids_list()
data = hnsw.get_items(ids)

That seemed convenient especially for indices I'm loading from files, because I do not need to also load the original data file.

However, with indices using the cosine space I have noticed that the vectors stored inside the index look very different from those that went in during the indexing process. These are, as I learned here, normalized vectors.

Is that the reason I'm getting these near-zero recall rates?

Do I, when using cosine indices, need to benchmark using the original, un-normalized vectors?

Question 2:

When loading a saved index, it seems that I need to initialize an index first. One of your code snippets says:

the space can be changed - keeps the data, alters the distance function

So what if I build an index using space="l2", save it to file, and then load this index file into an empty index that I initialize using space="cosine"?

Would that mean that,

  1. the vectors will remain non-normalized, i.e. exactly as I added them, and
  2. I'd still be able to perform nearest neighbor search using the cosine function?

If so, that would essentially be the feature of doing cosine search with un-normalized vectors that has been talked about in some of the GitHub issues. Is that correct?

Question 3:

Actually, one more question...

If I measure recall for a large index with, say, tens of millions of items, could I just query for, say, 10 % of the items?

In my testing, I got slightly different results: When querying 100 % of the vectors, I got 0.93. And when querying only 10 %, I got 0.87, for instance. But if all I'm trying to do is to optimize my parameters (M and ef_construction), I would still be able to compare the numbers as long as I'm always querying the same fraction of items, wouldn't I?

I am asking because querying only a subset of items saves a lot of time and compute during benchmarking.

Thanks!

yurymalkov commented 2 years ago

Hi @leobaumgardt Q1: Yes, the vectors are normalized during insertion inside the python bindings to avoid doing in actual calculation and in C++ it has the same calculation as for inner product with normalized queries. It should not affect the recall though if you calculate brute force for cosine distance (it is invariant to normalization). Can you share a code snipper that reproduces the problem (e.g. on random data)? Q2: Yeah. I didn't though about that. The transfer only works between l2 and inner product - need to update the description. When you open with cosine it does not normalized the vectors (which I guess can be fixed ), so it basically does inner product with the normalized query (which yields the same results are inner product, but the distances will be off). I guess we can have a separate unnormalized cosine space (it is not hard to implement, at least without manual vectorization), though the performance would not match. Let me know if this is of interest. Q3: I am not totally sure I understand "query for, say, 10 % of the items?". If it is reducing the number of test queries, yeah, you should get a good estimate for the recall as long as 1) the subset of queries is random (meaning you subsample randomly); 2) the set is large enough to get a good estimate (though I guess 1000 queries would be enough to distinguish between 0.87 and 0.93).

leoplusx commented 2 years ago

Hi @yurymalkov!

Thanks for your response.

Re: Q1:

You're saying that it shouldn't matter whether I query for non-normalized vectors (e.g. those that I provided during import) or for normalized ones (e.g. those that are stored in the index after importing with the space=cosine setting). So these two code snippets should yield the same results:

Snippet 1

ids = list_of_original_ids_used_during_import
vectors = list_of_original_vectors_used_during_import
labels, distances = hnsw.knn_query(vectors, k=1)
recall = np.mean(labels.reshape(-1) == ids )
print(f"Recall : {recall:,.2f}")

Snippet 2

ids = hnsw.get_ids_list()
vectors = hnsw.get_items(items)
labels, distances = hnsw.knn_query(vectors, k=1)
recall = np.mean(labels.reshape(-1) == ids )
print(f"Recall : {recall:,.2f}")

Recall should be the same in both instances, even though the actual vectors do look different. Is that correct?

Re: Q2:

Cool, thanks! No, I don't think I'd need this separate unnormalized cosine space. But what you're saying helps me understand how things work under the hood a little better!

Re: Q3:

Yes, I meant reducing the number of test queries.

The way I've been doing it now is this:

# import 39M data points
ids = list_of_original_ids_used_during_import
vectors = list_of_original_vectors_used_during_import

# limit to 1M
ids = ids[:1_000_000]
vectors = vectors[:1_000_000]

# query vectors for themselves and measure recall
hnsw.set_ef(ef_construction)
labels, distances = hnsw.knn_query(vectors, k=1)
recall = np.mean(labels.reshape(-1) == ids )
print(f"Recall : {recall:,.2f}")

I built multiple hnsw indices on this same set of 39M labels and vectors, to test different values for ef_construction and M. And then I ran the above code to compare the recall on those.

Since all indices are built with the same 39M items, and I'm always querying for the same first 1M of those, it would seem to me that I would not need to randomize.

For some reason, though, I'm always getting recall = 0.78. Even if I re-run the benchmark using different values for ef, I keep getting 0.78.

I find this odd. I would have expected, for instance, that recall should be higher on indices built with higher values for ef_construction/M, and also higher when running queries with a higher ef setting. So I'm wondering if I'm doing the benchmarking wrong here.

yurymalkov commented 2 years ago

Q1: Yes, it is easy to test:

dim=64
num_elements=10000
ids = range(num_elements)
vectors = np.random.random((num_elements,dim))

p = hnswlib.Index(space = 'cosine', dim = dim) # possible options are l2, cosine or ip
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
p.add_items(vectors, ids)

p.set_ef(2)

labels, distances = p.knn_query(vectors, k=1)
recall = np.mean(labels.reshape(-1) == ids )
print(f"Recall : {recall:,.5f}")

vectors1 = p.get_items(ids)
labels, distances = p.knn_query(vectors1, k=1)
recall = np.mean(labels.reshape(-1) == ids )
print(f"Recall : {recall:,.5f}")

Q3 On thing mentioned above is that you need to shuffle the vectors list before you do [:n]. As for why ef does not affect the recall that seems like a bug (it can be anything)