opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 115 forks source link

[BUG] Race condition with approximate query leading to Index has already been closed exception #2262

Open kotwanikunal opened 7 hours ago

kotwanikunal commented 7 hours ago

What is the bug? There is a race condition in the current code path for the native memory cache manager. An entry from the NativeMemoryCacheManager can be evicted before being read for the query, resulting in an "Index has already been closed" exception.

The issue stems from the current code flow, which acquires a read lock after the entry is loaded into the cache. This disjoint between loading and locking operations creates a window where the cache can evict the entry before the read lock is acquired.

Code References Load to cache: KNNWeight.java#L286-L302 Read Lock: KNNWeight.java#L313 Exception: KNNWeight.java#L316-L318

How can one reproduce the bug? The error is reproducible with force eviction turned on along with running a OSB vectorsearch benchmark with 5+ clients

What is the expected behavior?

What is your host/environment?

kotwanikunal commented 7 hours ago

Proposed Solutions

  1. Evictable Flag

    • Add an evictable flag to the allocation object
    • Acts as a lock, released only when the read lock is acquired
    • Prevents deletion and indicates if the object was used after loading
    • Requires warmup to perform a separate cleanup (potentially a new API)
  2. Synchronized Block

    • Move nativememorycachemanager.get() behind a synchronized block
    • Pros: Simple implementation
    • Cons: Reduces concurrency and may create a bottleneck for loading
  3. Reference Counter

    • Implement a reference counter mechanism
    • Increment the counter when the object is created
    • Requires explicit decrement by the user of the allocation object
    • Similar to solution 4 without introducing new constructs
  4. Modified Get API

    • Update the get API with a new parameter
    • Preemptively acquire the read lock as soon as the load is complete
kotwanikunal commented 7 hours ago

Sample solution:

public NativeMemoryAllocation get(
        NativeMemoryEntryContext<?> nativeMemoryEntryContext,
        boolean isAbleToTriggerEviction,
        boolean acquirePreemptiveReadlock
    ) {

    ...

    boolean lockAcquired = false;
            try(nativeMemoryEntryContext) {
                nativeMemoryEntryContext.ensureAvailabilityOfAllocationObject();
                synchronized (this) {
                    if (getCacheSizeInKilobytes() + nativeMemoryEntryContext.calculateSizeInKB() >= maxWeight) {
                        Iterator<String> lruIterator = accessRecencyQueue.iterator();
                        while (lruIterator.hasNext()
                                && (getCacheSizeInKilobytes() + nativeMemoryEntryContext.calculateSizeInKB() >= maxWeight)) {

                            String keyToRemove = lruIterator.next();
                            NativeMemoryAllocation allocationToRemove = cache.getIfPresent(keyToRemove);
                            if (allocationToRemove != null) {
                                allocationToRemove.close();
                                cache.invalidate(keyToRemove);
                            }
                            lruIterator.remove();
                        }
                    }
                result = cache.get(key, nativeMemoryEntryContext::load);
                    // ensure that we take a lock on the allocation as when it is loaded in memory to ensure that it
                    // cannot be evicted as we are going to use this allocation.
                    if (acquirePreemptiveReadlock) {
                        result.incRef();
                        lockAcquired =  true;
                    }
                    accessRecencyQueue.addLast(key);

                    return result;

      }