kakao / n2

TOROS N2 - lightweight approximate Nearest Neighbor library which runs fast even with large datasets
Apache License 2.0
569 stars 70 forks source link

How dose n2 handle zero vectors? #38

Closed huhk-sysu closed 3 years ago

huhk-sysu commented 3 years ago

Hi, I may have some zero vectors in my data, and I'm using the "angular" distance metric. According to the formula (1 - {sum(p i · q i) / sqrt(sum(p i · p i) · sum(q i · q i))}), the denominator will become zero when any of (p, q) is a zero vector. I have done some experiments and found n2 doesn't report any warning or error when using zero vectors with "angular". Could you let me know how does n2 treat the zero vectors? Like, what's the definition of angular distance in such situation, and where would a zero vector rank when querying the nearest neighbors? Thank you!

gony-noreply commented 3 years ago

Hi @huhk-sysu. The formula you mentioned can be divided into two parts, sum(p i · q i) and sqrt(sum(p i · p i) · sum(q i · q i)). If you apply L2 normalization to the origin vector P / Q, the value of the second part becomes 1 and can be ignored. With the normalized vector P' and Q', the angular distance is calculated as 1 - sum(p' i · q' i) and equals with 1 - dot(P', Q'). Indexed vector P is normalized once at build time, query vector Q can be normalized once just before search according to exact distance need. Since L2 normalization can reduce the amount of computation, most ANN library uses a similar approach. This can be easily found in the ANN benchmark as well, see here

Because of the above, I think that if I answer the zero vector processing method in L2 normalization, It will be a sufficient answer to your question. Since the zero vector can't be normalized, It becomes a zero vector as it is. The scikit-learn handles similarly, see here.

So the angular distance calculates with zero vector is 1.0, for angular distance, 1.0 is a very large value, so there is usually no problem. However, if the zero vector is a query vector, all distances are 1.0, which is a problem. In this case, it may be better to use a different metric.

huhk-sysu commented 3 years ago

Thank you!

Indexed vector P is normalized once at build time

You have mentioned L2 normalization here. Should I manually apply normalization to the vectors before building index and searching, or n2 would handle it for me?

gony-noreply commented 3 years ago

n2 would handle it for me?

Yes, n2 handles it internally. You can just use n2 with the usual angular distance meaning!