spotify / voyager

🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability.
https://spotify.github.io/voyager/
Apache License 2.0
1.27k stars 52 forks source link

Support proper updates in StringIndex #3

Open dylanrb123 opened 1 year ago

dylanrb123 commented 1 year ago

When you call index.add with an existing ID but a new vector, Voyager will correctly locate the correct spot in the graph to place the new vector and update all of the existing connections. However, the StringIndex abstraction does not support this behavior. As written, the string index will add a new vector and new ID if index.add is called with a name that is already present in the index, instead of updating the existing one.

Instead of adding a new item and keeping the old one around, Voyager should check the existing items list and update the underlying index accordingly with the new vector value so we don't end up with duplicate items in the index.

karchx commented 1 year ago

Hi @dylanrb123 Just as a doubt, does that imply doing a search algorithm on the HNSW algorithm to find the elements that already exist?

dylanrb123 commented 11 months ago

Hi @karchx, I updated the description of the issue to make it more clear. I'm not sure I fully understand your question but let me know if anything is still unclear.

samek commented 10 months ago

@dylanrb123 you mean like checking If we already have this name in it and if use index of that name ? eg.

public void addItem(String name, float[] vector) {
    int updateIndex = names.indexOf(name);
    if (updateIndex==-1) {
      int nextIndex = names.size();
      index.addItem(vector, nextIndex);
      names.add(name);
    } else {
      index.addItem(vector, updateIndex);
    }
}
markkohdev commented 5 months ago

Moving discussion about this from #64 to here. Relevant comments:

From @markkohdev

In regards to upserting, after chatting with @dylanrb123 about this change, we don't think that this approach is how we would like to see this implemented. The implementation you have here results in traversal of the entire names list upon every addition in the worst case.

In order to support upserting functionality I think we would need a slightly more involved solution. I'll be implementing StringIndex in the C++ core soon enough and will implement this functionality there (and deprecate this class).

I would recommend removing the upsert functionality for the moment and reducing the scope of this PR. However, if you are very motivated to enable this functionality before StringIndex support in voyager core comes along we can discuss how we might support this using a naive implementation in Java (effectively utilizing parallel HashMaps instead of an ArrayList for managing the strings)

From @samek

Ok, I've removed the functionality. But I'm keen on making this until we have whole string index in C++ core. What did you have in mind should we use ConcurrentHashMap ?

markkohdev commented 5 months ago

As mentioned above, how we would probably want to see this implemented is a bit complicated as there are performance/memory implications for this to be enabled.

Let's start with how this would work in its basic form:

Some details for consideration

^ cc @dylanrb123

dylanrb123 commented 5 months ago

I have a WIP branch with this partially implemented using two ConcurrentHashMaps https://github.com/spotify/voyager/tree/upsert-string-index. As @markkohdev mentioned, that approach will substantially increase memory usage to support a feature that most people probably don't care about which is why I didn't finish the implementation. IMO it would be better to just wait until we pull this into the C++ code or make it configurable as Mark suggested.

samek commented 5 months ago

@dylanrb123 for me it's important to be able to update existing record. Now question is will you finish this the way @markkohdev suggested or should I give it a go?