nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.18k stars 614 forks source link

bad recall performance #341

Open tangchen2 opened 2 years ago

tangchen2 commented 2 years ago

i build hnsw index ( cosine distance ) based on 1000w data, however, the recall performance is bad, some query data (cosine distance very small ) not be recalled, while some other unrelated data recalled. would num_threads, M, cef or sef affect the hnsw recall? (BTW in my experiment, M=72, cef=200, sef=2000)

yurymalkov commented 2 years ago

@tangchen2 Typically efConstruction should be bigger or equal compared to ef, so that might be one of the issues. I wonder, does your data have duplicates (elements with almost exact values)? If so, deduping should bring a huge performance/accuracy gain.

Jowekk commented 2 years ago

I also encountered this problem, and I felt that hnsw's recall under the large data set was seriously reduced.

I have 5000 pairs of face matching features. I mix one 5000 into completely unrelated face gallery data sets and use the other 5000 to retrieve them.

M = 48, efConstruction = 200, feature dim = 384.

Here are the results of HNSW:

gallery size: 1 million top[1]=0.772950 top[5]=0.818441 top[10]=0.836393 top[20]=0.850265 top[50]=0.866585 top[100]=0.880661 top[256]=0.894941 top[512]=0.906365 top[1024]=0.915749

gallery size: 5 million top[1]=0.672787 top[5]=0.714810 top[10]=0.728070 top[20]=0.741942 top[50]=0.758262 top[100]=0.767238 top[256]=0.777642 top[512]=0.785190 top[1024]=0.790698

Here is the search result of GPU flat for comparison:

gallery size: 5 million top[1]=0.725622 top[5]=0.780090 top[10]=0.801102 top[20]=0.821297 top[50]=0.847001 top[100]=0.861485 top[256]=0.879845 top[512]=0.893921 top[1024]=0.907181

Is there something wrong with me? If necessary, I can open my data sets.

Jowekk commented 2 years ago

I checked this discussion https://github.com/nmslib/hnswlib/issues/164. As to whether there are duplicate data, I use 5 million gallery data sets to retrieve themselves:

top 1: 4861229/5000000 = 0.972246 top 5: 4999996/5000000 = 0.999999 top 10: 5000000/5000000 = 1 top 20: 5000000/5000000 = 1 top 50: 5000000/5000000 = 1

It can be seen that there is a little duplicate data in my gallery data sets, but most of them still don't seem to be repeated. If there are other experimental details that need to be supplemented, I can provide corresponding experiments.

yurymalkov commented 2 years ago

@Jowekk I am not sure I understand. The recall you measure is the recall of the full ML pipeline for the same person?

Jowekk commented 2 years ago

5000 pairs: paired images taken by 5000 different people. gallery data: completely unrelated to the above 5000 people.

I used 5000 images as batch queries to recall the other 5000 paired images. If top one hits 1000, then top[1] = 1000 / 5000 = 0.2 Did I describe it clearly?

yurymalkov commented 2 years ago

@Jowekk I see, didn't notice you have brute force results. I wonder, did you try to alter the ef parameter (and what happens if you set it to say, 10000).

Jowekk commented 2 years ago

ef = 10000 is too big to meet our time requirements. When ef and gallery size are increased, a large number of random accesses will be caused because neighbors are randomly stored in memory. Bottlenecks accessed randomly allowed QPS to fall quickly. Do you think the number of neighbors scanned is too small? Here is a small experiment, the number of scans on neighbors in different gallery size:

gallery size 10,000 100,000 1000,000
scan num of neighbors 9049 29881 49351
yurymalkov commented 2 years ago

Looking on how recall changes with ef, while keeping other parameters fixed provides diagnostic information. E.g. if ef is extremely large and recall is low that hints the graph has disconnected components, etc

Jowekk commented 2 years ago

ok, this is a good idea worth trying. thank u