opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

[BUG] Knn Search Fails When Repeatedly Deleting and Inserting Vectors. #1080

Closed danyilq closed 3 months ago

danyilq commented 1 year ago

Describe the bug When performing Knn search queries on an index multiple times, with documents being deleted and inserted, the search occasionally does not return any hits.

To Reproduce

  1. Create a Knn index.
  2. Generate a vector to be used during tests.
  3. Add a document with the vector and refresh the index.
  4. Search for that vector and retrieve the document ID.
  5. Delete the document with the retrieved ID.
  6. Repeat steps 3-5 until the search returns no hits.

Expected behavior The search query should consistently return hits as long as there are documents in the index.

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

Additional context Python script that reproduces issue:

import random

# Disable insecure request warning
requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)

# OpenSearch cluster configuration
opensearch_url = 'https://admin:admin@localhost:9200'
opensearch_index = ''.join(random.choice('abcdefghijklmnopqrstuvwxyz') for _ in range(10))
vec = [random.random() for _ in range(384)]

# Function to interact with OpenSearch
def opensearch_request(method, endpoint, data=None):
    url = f'{opensearch_url}/{opensearch_index}/{endpoint}'
    headers = {'Content-Type': 'application/json'}
    verify = False
    response = requests.request(method, url, json=data, headers=headers, verify=verify)
    return response

# Create the OpenSearch index

index_mapping = {...}  # Your index mapping here

print(opensearch_request('PUT', '', index_mapping).text)

# Main loop
for i in range(1000000):
    doc = {
        '__chunks': {
            '__field_name': f'field_{random.randint(1, 100)}',
            '__field_content': f'content_{random.randint(1, 100)}',
            '__vector_marqo_knn_field': vec
        }
    }

    opensearch_request('POST', '_doc', doc)
    opensearch_request('POST', '_refresh')
    knn_query = {
        "knn": {
            "__chunks.__vector_marqo_knn_field": {
                "vector": vec,
                "k": 100
            }
        }
    }

    full_knn_query = {
        "size": 100,
        "from": 0,
        "_source": {  # Exclude the vector field from the snippet
            "exclude": ["__chunks.__vector_marqo_knn_field"]
        },
        "query": {
            "nested": {
                "path": "__chunks",
                "inner_hits": {
                    "_source": {
                        "include": ["__chunks.__field_content", "__chunks.__field_name"]
                    }
                },
                "query": knn_query
            }
        }
    }

    search_results = opensearch_request('POST', '_search', full_knn_query).json()
    doc_ids = [hit['_id'] for hit in search_results.get('hits', {}).get('hits', [])]

    if doc_ids:
        print(f'Iteration {i}: {len(doc_ids)} results found. Deleting them.')
        opensearch_request('DELETE', f'_doc/{",".join(doc_ids)}')
    else:
        print(f'Iteration {i}: No results found.')
        break

to_delete_index = input("Delete the index? (y/n): ")
if to_delete_index.lower() == "y":
    opensearch_request('DELETE', '')

print("Script completed.")
danyilq commented 1 year ago

The iteration that it fails on is also consistent no matter of what vector is used, but dimensionality of vector amplifies which iteration it fails. With: 384 dimensions - 145th iteration. 512 dimensions - 109th iteration. 121 dimensions - 69th iteration. 728 dimensions - 109th iteration 1024 dimensions - 82nd iteration.

Same results are produced with ViT-L/14 and hf/all_datasets_v4_MiniLM-L6
pandu-k commented 1 year ago

Does forcemerging after each deletion help?

danyilq commented 1 year ago

Unfortunately forcemerging didn't help

dblock commented 1 year ago

Moved this to the k-nn repo.

danyilq commented 1 year ago

Index mapping that was used

{
    "settings": {
        "index": {
            "knn": True,
            "knn.algo_param.ef_search": 100,
            "refresh_interval": "1s",
            "store.hybrid.mmap.extensions": [
                "nvd", "dvd", "tim", "tip", "dim", "kdd", "kdi", "cfs", "doc", "vec", "vex"
            ]
        },
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "_meta": {
            "media_type": "text",
            "index_settings": {
                "index_defaults": {
                    "treat_urls_and_pointers_as_images": False,
                    "model": "hf/all_datasets_v4_MiniLM-L6",
                    "normalize_embeddings": True,
                    "text_preprocessing": {
                        "split_length": 2,
                        "split_overlap": 0,
                        "split_method": "sentence"
                    },
                    "image_preprocessing": {
                        "patch_method": None
                    },
                    "ann_parameters": {
                        "name": "hnsw",
                        "space_type": "cosinesimil",
                        "engine": "lucene",
                        "parameters": {
                            "ef_construction": 128,
                            "m": 16
                        }
                    }
                },
                "number_of_shards": 1,
                "number_of_replicas": 0
            },
            "model": "hf/all_datasets_v4_MiniLM-L6"
        },
        "dynamic_templates": [
            {
                "strings": {
                    "match_mapping_type": "string",
                    "mapping": {
                        "type": "text"
                    }
                }
            }
        ],
        "properties": {
            "__chunks": {
                "type": "nested",
                "properties": {
                    "__field_name": {
                        "type": "keyword"
                    },
                    "__field_content": {
                        "type": "text"
                    },
                    "__vector_marqo_knn_field": {
                        "type": "knn_vector",
                        "dimension": 384,
                        "method": {
                            "name": "hnsw",
                            "space_type": "cosinesimil",
                            "engine": "lucene",
                            "parameters": {
                                "ef_construction": 128,
                                "m": 16
                            }
                        }
                    }
                }
            }
        }
    }
}
navneet1v commented 1 year ago

@danyilq can you add the details on the number of nodes, RAM of the nodes too, to help us better understand the issue.