unum-cloud / usearch

Fast Open-Source Search & Clustering engine × for Vectors & 🔜 Strings × in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
https://unum-cloud.github.io/usearch/
Apache License 2.0
2.27k stars 143 forks source link

Low index performance after `clear()` #417

Open mz1979 opened 6 months ago

mz1979 commented 6 months ago

Describe the bug

Inserting vectors is extremely slow when using non-contiguous keys (Python SDK).

Steps to reproduce

Run this code and it will test the index insertion for contiguous and non-contiguous keys:

from usearch.index import Index
from random import random
import numpy as np

vectors = np.random.rand(600000, 256)
keys = np.arange(len(vectors))
offset = 1_000_000

keys_non_contiguous = []

for u in range(0, len(vectors), 50000):
    fileIndex = int(random()*10)
    batch = int(random()*256)
    batchIndex = int('0b' + bin(batch).removeprefix('0b').zfill(8) + '0'*32, 2)
    keys_non_contiguous.extend([batchIndex + fileIndex * offset + u for u in range(50000)])

keys_non_contiguous = np.array(keys_non_contiguous)

index = Index(
    ndim=256, # Define the number of dimensions in input vectors
    metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'
    dtype='f32', # Quantize to 'f16' or 'i8' if needed, default = 'f32'
    connectivity=16, # How frequent should the connections in the graph be, optional
    expansion_add=128, # Control the recall of indexing, optional
    expansion_search=64 # Control the quality of search, optional
  )

# This takes about 20 sec on a 32 vCPU machine
index.add(keys, vectors, log=True, copy=False)

index.clear()

# This takes about 1min15sec on a 32 vCPU machine
index.add(keys_non_contiguous, vectors, log=True, copy=False)

Expected behavior

Performance should match whether contiguous or non-contiguous keys.

USearch version

Build from source branch main-dev

Operating System

Ubuntu 24.04 LTS

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 6 months ago

The problem is in clear()! If you reinitialize the index variable with a new constructor it works just as fast. Neat finding! Will investigate.

mz1979 commented 6 months ago

I get the same performance issue if I do not run the clear but redefine my index variable: image