run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.9k stars 5.28k forks source link

[Bug]: Edgecase for Metadata Filter, using `document_id ` as metadata key doesn't work #17025

Open AMasetti opened 1 day ago

AMasetti commented 1 day ago

Bug Description

Using document_id as a metadata key seems to be reserved and leads to unexpected behavior, as queries to the index return an empty list of nodes. This issue is not documented in the Postgres Vector Store documentation, making it difficult to debug.

Version

0.11.23

Steps to Reproduce

Create a Postgres index with metadata that includes the key document_id. Attempt to query the index. Observe that the query returns an empty list of nodes. Change the metadata key from document_id to a different key (e.g., camelCase or snake_case). Observe that the query now works as expected.

Expected Behavior

Using document_id as a metadata key should either:

Work like any other metadata key, or Be explicitly documented as reserved to prevent its use waring the user.

Actual Behavior

Queries fail silently when document_id is used as a metadata key, returning an empty list of nodes.

Suggested Fix

Clearly document any reserved metadata keys in the Postgres Vector Store documentation. Optionally, add a warning or error when reserved keys are used.

Environment

Library version: 0.11.23 Database: Postgres Language: Python

from llama_index.core import VectorStoreIndex, Document
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

# Example documents with metadata
documents = [
    Document(text="Example text 1", metadata={"document_id": "value1"}),
    Document(text="Example text 2", metadata={"key": "value2"}),
]

# Define filters to use only nodes with a specific metadata key
filters = MetadataFilters(filters=[ExactMatchFilter(key="document_id", value="value1")])

# Create index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(filters=filters)

# Run query
response = query_engine.query("Your query here")
response.source_nodes

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 day ago

Hey @AMasetti! 👋 I'm here to help you with your bug and answer any questions you have. Let's squash this bug together!

The issue you're encountering with using document_id as a metadata key in the Postgres Vector Store of LlamaIndex seems to be related to how certain metadata keys are handled internally. While there isn't specific documentation or comments in the codebase that mention document_id as a reserved key, the library does have special handling for certain metadata keys, which might be causing this behavior [1][2][3].

The suggested fix to document any reserved metadata keys and potentially add warnings or errors when such keys are used is a good approach. This would help prevent silent failures and make debugging easier. In the meantime, using a different key name for your metadata should resolve the issue, as you've already observed.

If you need further assistance or a workaround, consider renaming the document_id key to something else that doesn't conflict with the library's internal handling.

To continue talking to Dosu, mention @dosu.

AMasetti commented 1 day ago

@dosu Good catch, the reference [3] also points to reserved keys that fail silently doc_id and response

dosubot[bot] commented 1 day ago

It's great that you found the reference to other reserved keys like doc_id and response. This highlights the importance of documenting these reserved keys to avoid silent failures. If you encounter any more issues or need further clarification, feel free to ask. Meanwhile, using alternative key names should help you avoid these conflicts. If there's anything else you need, just let me know!

To continue talking to Dosu, mention @dosu.