nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.34k stars 641 forks source link

mark_delete leads to exception at query time #321

Open siminchengithub opened 3 years ago

siminchengithub commented 3 years ago

I have an index containing 8 million items. Retrieval works fine for 1 million query with k=60. However, after I deleted 18k items using mark_deleted(). There were 8 queries giving exceptions of "Cannot return the results in a contigious 2D array. Probably ef or M is too small".

I think by marking an item as deleted it leads to unexpected bugs. Could you also explain how exactly a deletion happens? Will it guarantee to give back K results despite half of the index is marked deleted?

yurymalkov commented 3 years ago

Hi @siminchengithub Deletion happens by marking the elements as deleted and skipping them during formation of the final output (though, the elements are still traversed during the search). It should work even if almost all of all elements. E.g. such an example works despite deletion of almost all elements:

import hnswlib
import numpy as np
import pickle

dim = 16
num_elements = 1000000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
data_labels = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 100, M = 16)

# Element insertion (can be called several times):
p.add_items(data, data_labels)

# Controlling the recall by setting ef:
p.set_ef(100) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data[0], k = 989000)
print("closest before deletion (_, label, distance):")
for idx, l in enumerate(labels[0]):
    if idx<5:
        print(idx, l, distances[0][idx])
    p.mark_deleted(l)

labels, distances = p.knn_query(data[0:2], k = 10000)
print("closest after deletion  (_, label, distance):")
for idx, l in enumerate(labels[0]):
    if idx<5:
        print(idx, l, distances[0][idx])

I wonder if your data has duplicates? Can you please share the data?

siminchengithub commented 3 years ago

Hi Yury, thanks for the sample code. Will try to grab the data, it's quite big (10+gb). Could you tell me the implication of having duplicates? duplicates as in items having different index id but same vector, or having the same index id regardless of the vector?

siminchengithub commented 3 years ago

Hi Yury, thanks for the sample code. Will try to grab the data, it's quite big (10+gb). Could you tell me the implication of having duplicates? duplicates as in items having different index id but same vector, or having the same index id regardless of the vector?

also a small correction. The process was indexing of 8million -> addition of 30k -> deletion of 18k -> querying

yurymalkov commented 3 years ago

duplicates as in items having different index id but same vector, or having the same index id regardless of the vector?

Duplicates are the same vector with different ids. They might make the graph go crazy sometime (graph might lose connectivity), though not sure if it can affect deleted elements.

siminchengithub commented 3 years ago

Duplicates are the same vector with different ids. They might make the graph go crazy sometime (graph might lose connectivity), though not sure if it can affect deleted elements.

ah check. This type of duplication can happen rather often though, especially in an advertisement use case where different items can have the same description. How would you suggest to handle duplicates in this case?

yurymalkov commented 3 years ago

@siminchengithub The deduplication is supposed to be happen outside, though it could have been implemented inside the library (the downside is some slowdown during the construction and additional structures). If a user has duplicates, they can deduplicate, merge the labels of the duplicates (as the elements are exactly the same) and use the library.