Closed huhk-sysu closed 3 years ago
Hi @huhk-sysu.
The formula you mentioned can be divided into two parts, sum(p i · q i)
and sqrt(sum(p i · p i) · sum(q i · q i))
. If you apply L2 normalization to the origin vector P
/ Q
, the value of the second part becomes 1
and can be ignored. With the normalized vector P'
and Q'
, the angular distance is calculated as 1 - sum(p' i · q' i)
and equals with 1 - dot(P', Q')
. Indexed vector P is normalized once at build time, query vector Q can be normalized once just before search according to exact distance need. Since L2 normalization can reduce the amount of computation, most ANN library uses a similar approach. This can be easily found in the ANN benchmark as well, see here
Because of the above, I think that if I answer the zero vector processing method in L2 normalization, It will be a sufficient answer to your question. Since the zero vector can't be normalized, It becomes a zero vector as it is. The scikit-learn handles similarly, see here.
So the angular distance calculates with zero vector is 1.0
, for angular distance, 1.0 is a very large value, so there is usually no problem. However, if the zero vector is a query vector, all distances are 1.0, which is a problem. In this case, it may be better to use a different metric.
Thank you!
Indexed vector P is normalized once at build time
You have mentioned L2 normalization here. Should I manually apply normalization to the vectors before building index and searching, or n2
would handle it for me?
n2 would handle it for me?
Yes, n2 handles it internally. You can just use n2 with the usual angular distance meaning!
Hi, I may have some zero vectors in my data, and I'm using the "angular" distance metric. According to the formula (1 - {sum(p i · q i) / sqrt(sum(p i · p i) · sum(q i · q i))}), the denominator will become zero when any of (p, q) is a zero vector. I have done some experiments and found
n2
doesn't report any warning or error when using zero vectors with "angular". Could you let me know how doesn2
treat the zero vectors? Like, what's the definition of angular distance in such situation, and where would a zero vector rank when querying the nearest neighbors? Thank you!