run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.45k stars 5k forks source link

[Bug]: PGVectorStore incorrectly handles string metadata that resembles numbers #15962

Open nicolas-werner opened 6 days ago

nicolas-werner commented 6 days ago

Bug Description

The PGVectorStore is incorrectly handling string metadata that resembles numbers, causing a DataError: invalid input syntax for type double precision when filtering. This occurs presumably in the _build_filter_clause:458 method, which attempts to cast string values to floats if they can be interpreted as numbers.

try:
    return text(
        f"(metadata_->>'{filter_.key}')::float "
        f"{self._to_postgres_operator(filter_.operator)} "
        f"{float(filter_.value)}"
    )
except ValueError:
    # If not a number, then treat it as a string
    return text(
        f"metadata_->>'{filter_.key}' "
        f"{self._to_postgres_operator(filter_.operator)} "
        f"'{filter_.value}'"
    )

Workaround As a temporary workaround, you can add non-numeric text to the file names (e.g., "file_2024_123" instead of "2024_123"). This prevents the value from being interpreted as a number and avoids the casting issue.

Version

0.11.8

Steps to Reproduce

  1. Create a PGVectorStore instance
  2. Add documents with metadata where f.e 'file_name' is a string that looks like a number (e.g., "2024_123")
  3. Attempt to filter using this 'file_name'

Relevant Logs/Tracbacks

The filter operation fails with the error:
DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type double precision: "2024_123"
[SQL: SELECT public.data_contract_hybrid_vector_store.id, public.data_contract_hybrid_vector_store.node_id, public.data_contract_hybrid_vector_store.text, public.data_contract_hybrid_vector_store.metadata_, public.data_contract_hybrid_vector_store.embedding <=> %(embedding_1)s AS distance FROM public.data_contract_hybrid_vector_store WHERE (metadata_->>'file_name')::float = 20240717150625.0 ORDER BY distance asc LIMIT %(param_1)s]
dosubot[bot] commented 6 days ago

Hello @nicolas-werner! I'm here to help you with any bugs, questions, or contributions you may have. Let's work together to solve the issue you've encountered.

To fix the PGVectorStore to correctly handle string metadata that resembles numbers without causing a DataError: invalid input syntax for type double precision when filtering, you need to ensure that the metadata is treated as text in the SQL queries. This can be done by explicitly casting the metadata fields to text in the filter conditions.

Here is a modified version of the query method to handle this:

def query(self, query: VectorStoreQuery, **kwargs: Any) -> VectorStoreQueryResult:
    def cast_to_text_filter(filters):
        return {key: f"CAST({value} AS TEXT)" for key, value in filters.items()}

    results = self._client.search(
        embedding=query.query_embedding,
        top_k=query.similarity_top_k,
        filter=(
            meta_contains(
                cast_to_text_filter({pair.key: pair.value for pair in query.filters.legacy_filters()})
            )
            if query.filters is not None
            else None
        ),
    )

    nodes = [
        metadata_dict_to_node(record.meta, text=record.text)
        for record, _ in results
    ]

    return VectorStoreQueryResult(
        nodes=nodes,
        similarities=[score for _, score in results],
        ids=[str(record.id) for record, _ in results],
    )

In this code, the cast_to_text_filter function ensures that all metadata values are cast to text, preventing the DataError caused by invalid input syntax for type double precision [1].

To continue talking to Dosu, mention @dosu.

CharbelAD commented 2 days ago

Couldn't this issue be fixed by catching DataError in the except?