Duplicate vector handling

fjsj commented 3 years ago

Hi, thanks for open sourcing this great library! This is more a question than an actual issue: does n2 handles duplicate vectors without performance degradation or recall issues?

Other ANN libraries tend to suffer with that. See:

gony-noreply commented 3 years ago

does n2 handles duplicate vectors without performance degradation or recall issues?

Yes, since version 0.1.7

The HNSW algorithm doesn't work efficiently on duplicate vectors. We thought this was because the heuristic neighbor selection algorithm focused only on navigation. With the heuristic neighbor selection, duplicate or near-duplicate vectors are hidden and search becomes difficult, resulting in a low recall.

To solve this, we modified the heuristic neighbor selection algorithm and improved it in a form that has some nearest neighbors but does not degrade navigation performance.

Below is one of the benchmarks measured for the 0.1.7 release, and GIST has duplicate vectors(about 2% of train vectors are duplicated)

You can see a high recall compared to N2 version 0.1.6

Handling duplicate vectors have a tradeoff relationship with navigation performance, the way we handled it may not be optimal. So we are continuing to work to find if there is a better way.

fjsj commented 3 years ago

It's awesome that you're tackling this problem, thank you very much for the detailed response. Please feel free to close the issue if you wish.

gony-noreply commented 3 years ago

If we found another achievement for that problem, I'll comment here.

kakao / n2

Duplicate vector handling #39