nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.12k stars 609 forks source link

To avoid incorrect usage that leads to duplicate elements. #487

Closed LIBA-S closed 11 months ago

LIBA-S commented 11 months ago

If the user needs to update an element, but the error enables replace_deleted when calling the addPoint interface, it may result in multiple elements with the same label. By adding additional restrictions within the interface, we can effectively help users use it correctly at a small cost.

yurymalkov commented 11 months ago

Hi @LIBA-S,

Thanks for the pull request! I wonder, maybe instead of throwing an error, use the default update the element (replace_deleted=false)?

I thought that replace_deleted is used to save RAM, which is also the case of updating the element. If the RAM logic applies, I guess there should not be a need to throw an exception.

LIBA-S commented 11 months ago

replace_deleted is a good option as it can reuse the memory space that has been marked as deleted when inserting new elements. However, if the user's intention is to update an existing element, but the replace_deleted is enabled at the same time, it may result in two nodes in the graph having the same user label. If the user relies on the label to retrieve documents, this could lead to returning duplicate data.

Consider the following sequence of operations:

  1. addPoint(vec_a_1, label_a, true)
  2. addPoint(vec_b_1, label_b, true)
  3. makeDelete(label_b)
  4. addPoint(vec_a_2, label_a, true) // In fact, the replace_deleted should not be enabled.

In this example, what the user actually intended to do was to update vec_a_1 with vec_a_2. However, due to incorrect use of replace_deleted, there are now two nodes in the graph with the same label_a. If the user searches for the top 3, two label_a nodes will be returned. If the user does not deduplicate label_a, this could lead to duplicate results. The purpose of this submission is to ensure a unique correspondence between user labels and valid nodes in the graph, making it easier for users to troubleshoot issues.

yurymalkov commented 11 months ago

Yeah, I see your point about the current incorrect behavior. I wonder, if we should alter the code, so it would ignore replace_deleted in case the label exists (similar to the situation when we don't have any deleted element), instead of throwing an exception?

LIBA-S commented 11 months ago

Sorry, I didn't get the point at first. Ignoring the user's mistake may be a good option from a user experience perspective. However, I believe that if the user makes a mistake, it should be corrected. At the same time, this approach may be more controllable and cost-effective.

yurymalkov commented 11 months ago

Can you explain in more details why it should be considered a user mistake and user should be alerted about it?

LIBA-S commented 11 months ago

In fact, when we used this interface in the project, we encountered the execution sequence in the above example, which caused the search to return two data with the same label. Choosing to help users ignore this option is a good choice and can avoid the above problems. Our starting point for choosing to report an error is to consider code stability and performance. The current change only affects the path of adding new nodes, and the impact is small and the execution is fast. Once the user corrects the usage by throwing an exception, the performance loss brought by this change can be almost negligible. If we choose to ignore it, there will inevitably be more pre-judgment logic, which will lead to a larger code change.

From the perspective of code design, I think your choice is more correct, and this should not be considered an exception. I plan to revoke this MR, thank you for your patient response.

yurymalkov commented 11 months ago

Hi @LIBA-S,

If we choose to ignore it, there will inevitably be more pre-judgment logic, which will lead to a larger code change. Can you please elaborate on that?

Note that I do not argue that there is bug that needs to be fixed. I want to understand what is the best bug solution in terms of long-term API (which will be set in stone after this change). I see at least two solutions: 1) treat replace_deleted=false for the elements that are updated as a user mishandling the API as proposed in the PR. 2) ignore replace_deleted=false for the elements that are updated. Basically calling addPoint(data_point, label, -1); in the condition https://github.com/nmslib/hnswlib/pull/487/files#diff-ab27cbb27975c68cb0c6da824871058623f7f76a761c3c8365ef2e1395cf7cd9R898-R903 similar to the case when there is no vacancy.

I thought that 2 is the better option which also (I think) fixes the stability, but I might be not understanding side effects of that decision (so asking for clarifications).