Same code but poor recall in different machine

sujigrena commented 3 years ago

Hi,

I tried to create an index for around 60k embeddings of size 128 in a highend machine with 36 cores (with parameters ef_construction=300, M=32, ef=10). The index file generated was around 42 MB. The recall was around 95%. When I used the same code in a low end machine with 4 cores, the recall dropped to 80%. Changing hnsw parameters didn't help much. The size of the index file generated was just 5MB. Can you please help me understand this difference and how to stabilize the index creation across different hardware.

yurymalkov commented 3 years ago

Hm. That sounds strange. The only thing I can imagine is that floating point/integer numbers are represented differently in the memory as hnswlib saves the memory. Can you provide the details on the used machines (architecture)?

sujigrena commented 3 years ago

Thanks for reply @yurymalkov . Adding the architecture from both machines: high end machine: Linux user 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

low end machine: Linux user 4.19.167+ #1 SMP Tue Feb 2 07:59:55 PST 2021 x86_64 GNU/Linux

scott-weitze commented 3 years ago

This behavior is to be expected as an artifact of how the graph is constructed in parallel. It is important to note that the two graphs will be different when constructing each of them in parallel. The way to guarantee that the two graphs are exactly the same is to construct them on both machines using a singled thread for construction.

The worker threads that build the graph each individually add different elements to the graph. But for various reasons in the algorithm (maximal level of insertion, varying search complexity), the threads don't have equal workload, which in turn results in a change in the order in which the vertices are inserted.

I would be happy to elaborate more on why this is if you need a more technical description of why this is happening, or if my description was not clear.

nmslib / hnswlib

Same code but poor recall in different machine #317