jbellis / jvector

JVector: the most advanced embedded vector search engine
Apache License 2.0
1.47k stars 110 forks source link

Bad interaction between GraphSearcher pooling in GraphIndexBuilder and GraphIndexBuilder closing GraphSearcher #272

Open jkni opened 5 months ago

jkni commented 5 months ago

GraphIndexBuilder uses explicit thread-locals to pool GraphSearchers. It closes these GraphSearchers, which closes their views, which causes active view tracking to malfunction for OnHeapGraphIndex.

If we update GraphIndexBuilder to not close these views, they stay active indefinitely. We need to manage these views correctly, ideally while retaining most of the benefits of pooling.

jbellis commented 5 months ago

If/when we add back concurrent removeDeletedNodes, this is the PR where it got removed: https://github.com/jbellis/jvector/pull/273

jkni commented 5 months ago

The tension here is between the need to shorten the lifetime of OnHeapGraphIndex views and the GraphIndexBuilder pooling of GraphSearchers, which keep a view for their lifetime. We want GraphIndexBuilders to pool GraphSearchers to reduce allocations of associated data structure, but because the OHGI view is used as a reference point for concurrent deletions, it can't be retained indefinitely. Here are a few ideas I've tried, none of which are elegant:

There are a couple ways to approach this, but none neatly resolve the tension of wanting to keep OHGI views short-lived and keep searchers pooled. We could just special-case something.

jbellis commented 5 months ago

cc @mdogan since I think you're potentially interested in having a concurrent removeDeletedNodes

mdogan commented 5 months ago

I think GraphSearcher taking a view parameter externally but closing it looks a bit off, asymmetric. If GraphSearcher closes the view then it should create it upfront, not expect it to be created outside.

So in this sense, 1st option sounds fine to me, making GraphSearcher to take a graph. We can just add a new flag (resumable?) for search method or have two different methods for resumable and non-resumable searches.