nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.31k stars 633 forks source link

RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small #465

Open AmanKishore opened 1 year ago

AmanKishore commented 1 year ago

Getting this error with the following code:

vectordb = Chroma.from_documents(results, embeddings)
relevant_docs = vectordb.similarity_search(query=item.question, k=min(len(vectordb.get()["ids"]), num_search_results))

Any ideas how to fix?

yurymalkov commented 1 year ago

Hi @AmanKishore,

Can you provide more details on the dataset? Are you using filtering?

aramperes commented 1 year ago

I can reproduce this when filtering out enough items that k > filtered_element_count.

Say I have an index with 10 documents, but only 2 evaluate to true in _predicate(id):

# idx: ef_construction=200, M=64
docs, distance = idx.knn_query(vector, k=2, filter=_predicate, num_threads=1)
# works OK, returns 2 docs

# delete the first result
idx.mark_deleted(docs[0])

# do the same search (expect 1 results because of filtering)
docs, distance = idx.knn_query(vector, k=2, filter=_predicate, num_threads=1)
# RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small

It would be nice for the knn_query() function contract to allow return less than k items if the list of valid elements is exhausted.

yurymalkov commented 1 year ago

Hi @aramperes,

Thanks for the example. We will fix it soon for non-batched queries.

I wonder if it is a problem for batched queries? There there would be an issue with a the different number of returned nearest neighbors, which would require a flag that would do switching output to lists, padding, or returning the number of items. I am not sure which would work the best.