Open kotwanikunal opened 1 week ago
Evictable Flag
Synchronized Block
Reference Counter
Modified Get API
Sample solution:
public NativeMemoryAllocation get(
NativeMemoryEntryContext<?> nativeMemoryEntryContext,
boolean isAbleToTriggerEviction,
boolean acquirePreemptiveReadlock
) {
...
boolean lockAcquired = false;
try(nativeMemoryEntryContext) {
nativeMemoryEntryContext.ensureAvailabilityOfAllocationObject();
synchronized (this) {
if (getCacheSizeInKilobytes() + nativeMemoryEntryContext.calculateSizeInKB() >= maxWeight) {
Iterator<String> lruIterator = accessRecencyQueue.iterator();
while (lruIterator.hasNext()
&& (getCacheSizeInKilobytes() + nativeMemoryEntryContext.calculateSizeInKB() >= maxWeight)) {
String keyToRemove = lruIterator.next();
NativeMemoryAllocation allocationToRemove = cache.getIfPresent(keyToRemove);
if (allocationToRemove != null) {
allocationToRemove.close();
cache.invalidate(keyToRemove);
}
lruIterator.remove();
}
}
result = cache.get(key, nativeMemoryEntryContext::load);
// ensure that we take a lock on the allocation as when it is loaded in memory to ensure that it
// cannot be evicted as we are going to use this allocation.
if (acquirePreemptiveReadlock) {
result.incRef();
lockAcquired = true;
}
accessRecencyQueue.addLast(key);
return result;
}
What is the bug? There is a race condition in the current code path for the native memory cache manager. An entry from the NativeMemoryCacheManager can be evicted before being read for the query, resulting in an "Index has already been closed" exception.
The issue stems from the current code flow, which acquires a read lock after the entry is loaded into the cache. This disjoint between loading and locking operations creates a window where the cache can evict the entry before the read lock is acquired.
Code References Load to cache: KNNWeight.java#L286-L302 Read Lock: KNNWeight.java#L313 Exception: KNNWeight.java#L316-L318
How can one reproduce the bug? The error is reproducible with force eviction turned on along with running a OSB vectorsearch benchmark with 5+ clients
What is the expected behavior?
What is your host/environment?