[BUG] Race condition with approximate query leading to Index has already been closed exception

opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data

Apache License 2.0

156 stars 123 forks source link

What is the bug? There is a race condition in the current code path for the native memory cache manager. An entry from the NativeMemoryCacheManager can be evicted before being read for the query, resulting in an "Index has already been closed" exception.

The issue stems from the current code flow, which acquires a read lock after the entry is loaded into the cache. This disjoint between loading and locking operations creates a window where the cache can evict the entry before the read lock is acquired.

Code References Load to cache: KNNWeight.java#L286-L302 Read Lock: KNNWeight.java#L313 Exception: KNNWeight.java#L316-L318

How can one reproduce the bug? The error is reproducible with force eviction turned on along with running a OSB vectorsearch benchmark with 5+ clients

What is the expected behavior?

Query to go through

What is your host/environment?

OS: Linux AL2
Version OS 2.17
Plugins - kNN

Proposed Solutions

Evictable Flag

Add an evictable flag to the allocation object
Acts as a lock, released only when the read lock is acquired
Prevents deletion and indicates if the object was used after loading
Requires warmup to perform a separate cleanup (potentially a new API)

Synchronized Block

Move nativememorycachemanager.get() behind a synchronized block
Pros: Simple implementation
Cons: Reduces concurrency and may create a bottleneck for loading

Reference Counter

Implement a reference counter mechanism
Increment the counter when the object is created
Requires explicit decrement by the user of the allocation object
Similar to solution 4 without introducing new constructs

Modified Get API

Update the get API with a new parameter
Preemptively acquire the read lock as soon as the load is complete

Sample solution:

public NativeMemoryAllocation get(
        NativeMemoryEntryContext<?> nativeMemoryEntryContext,
        boolean isAbleToTriggerEviction,
        boolean acquirePreemptiveReadlock
    ) {

    ...

    boolean lockAcquired = false;
            try(nativeMemoryEntryContext) {
                nativeMemoryEntryContext.ensureAvailabilityOfAllocationObject();
                synchronized (this) {
                    if (getCacheSizeInKilobytes() + nativeMemoryEntryContext.calculateSizeInKB() >= maxWeight) {
                        Iterator<String> lruIterator = accessRecencyQueue.iterator();
                        while (lruIterator.hasNext()
                                && (getCacheSizeInKilobytes() + nativeMemoryEntryContext.calculateSizeInKB() >= maxWeight)) {

                            String keyToRemove = lruIterator.next();
                            NativeMemoryAllocation allocationToRemove = cache.getIfPresent(keyToRemove);
                            if (allocationToRemove != null) {
                                allocationToRemove.close();
                                cache.invalidate(keyToRemove);
                            }
                            lruIterator.remove();
                        }
                    }
                result = cache.get(key, nativeMemoryEntryContext::load);
                    // ensure that we take a lock on the allocation as when it is loaded in memory to ensure that it
                    // cannot be evicted as we are going to use this allocation.
                    if (acquirePreemptiveReadlock) {
                        result.incRef();
                        lockAcquired =  true;
                    }
                    accessRecencyQueue.addLast(key);

                    return result;

      }

opensearch-project / k-NN

[BUG] Race condition with approximate query leading to Index has already been closed exception #2262

Proposed Solutions