run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.85k stars 5.09k forks source link

[Bug]: Weaviate MetadataFilters break on "numeric" strings #10252

Open KristianMischke opened 8 months ago

KristianMischke commented 8 months ago

Bug Description

After updating llama index from 0.8.50 to 0.9.35 my vector query broke. MetadataFilters that use strings containing numeric values throw error complaining that valueNumber does not match valueText

Version

0.9.35

Steps to Reproduce

All you have to do is have a numeric string value in a metadata filter against a valueText field and the error will occur.

Seems to be a result of https://github.com/run-llama/llama_index/blob/851399a303a47972fb62b9bb8880434842e23dc3/llama_index/vector_stores/weaviate.py#L78 but not sure why that line was added in the first place

import weaviate
from llama_index.schema import TextNode
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.vector_stores.types import MetadataFilters, MetadataFilter, VectorStoreQuery

client = weaviate.Client(url="http://localhost:8080")
vector_store = WeaviateVectorStore(weaviate_client=client, index_name="TestIndex", text_key="text")

vector_store.add(
    [
        TextNode(
            text="test 1",
            metadata={"article_id": "aaff"},
            embedding=[0, 0, 1],
        ),
        TextNode(
            text="test 2",
            metadata={"article_id": "1234"},
            embedding=[0, 1, 0],
        ),
        TextNode(
            text="test 3",
            metadata={"article_id": "3ff3"},
            embedding=[1, 0, 0],
        )
    ]
)

# -- working query
query = VectorStoreQuery(
    query_embedding=[0, 0, 0],
    similarity_top_k=2,
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="article_id", value="3ff3") # NOTE: works fine because is not a number
        ]
    )
)
results = vector_store.query(query)
print(results)

# -- below query breaks
query = VectorStoreQuery(
    query_embedding=[0, 0, 0],
    similarity_top_k=2,
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="article_id", value="1234") # NOTE: breaks because it's a number but the metadata field is a text field
        ]
    )
)
results = vector_store.query(query)
print(results)

Relevant Logs/Tracbacks

results = vector_store.query(query)
  File "/usr/local/lib/python3.10/site-packages/llama_index/vector_stores/weaviate.py", line 338, in query
    parsed_result = parse_get_response(query_result)
  File "/usr/local/lib/python3.10/site-packages/llama_index/vector_stores/weaviate_utils.py", line 65, in parse_get_response
    raise ValueError("Invalid query, got errors: {}".format(response["errors"]))
ValueError: Invalid query, got errors: [{'locations': [{'column': 6, 'line': 1}], 'message': 'invalid \'where\' filter: child operand at position 0: data type filter cannot use "valueNumber" on type "text", use "valueText" instead', 'path': ['Get', 'User91609c28065a470e96c78465a7f40d6dReport1']}]
dosubot[bot] commented 8 months ago

🤖

Hello again @KristianMischke! Nice to see you around. Hope your day's going well. Let's dive into this issue you're facing.

Thank you for providing a detailed description of the issue you're facing. It seems like the problem is related to how the WeaviateVectorStore class in LlamaIndex is handling metadata filters with numeric strings.

From the traceback you provided, the error message suggests that the WeaviateVectorStore is trying to use the valueNumber field for a text type, which is causing the issue. This is likely due to the fact that the WeaviateVectorStore is not correctly identifying the type of the metadata filter value and is treating it as a number because it contains numeric characters, even though it's actually a string.

The line of code you pointed out in the weaviate.py file is part of the _to_weaviate_filter function, which is used to transform standard filters to Weaviate specific filters. This function is then used in the query method of the WeaviateVectorStore class to apply these filters to the query.

As for why this line was added, without more context it's hard to say for sure, but it's possible that it was intended to handle cases where the metadata filter value is actually a number, and not a string containing numeric characters.

Unfortunately, I wasn't able to find any changes in the handling of MetadataFilters in the WeaviateVectorStore class between LlamaIndex versions 0.8.50 and 0.9.35, or any changes in the latest version 0.9.36, based on the provided context.

To resolve this issue, a possible solution could be to update the _to_weaviate_filter function to correctly identify the type of the metadata filter value and use the appropriate field (valueNumber for numbers and valueText for strings) based on its type.

I hope this helps! If you have any more questions or need further clarification, please don't hesitate to ask.

Source Code

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

nerdai commented 8 months ago

Thanks for raising @KristianMischke. It looks like you're right about where in the code this bug comes from. Just thinking of how best we can resolve this. We may need to expose another parameter for MetadataFilters that stipulates how to deal with strings that can be made numeric.

MetadataFilter(key="article_id", value="1234", value_type="str")

where value_type is an Enum.

@logan-markewich what do you think?

logan-markewich commented 8 months ago

@nerdai I wonder if theres a way to use weaviate to check the type of a field, and coerce to the proper type when creating the weaviate filters?

nerdai commented 8 months ago

Good point. I agree -- that should be what we consider first here.

nerdai commented 8 months ago

Assigned it P1 priority @logan-markewich. Please do change this if you feel the need to do so.

KristianMischke commented 8 months ago

Yeah weaviate schema properties have data types: https://weaviate.io/developers/weaviate/config-refs/datatypes that could be used for converstion

udayvakalapudi commented 5 months ago

I am also facing the same issue with llamaindex-raptor pack.

`vector_store = WeaviateVectorStore(weaviate_client=vdb_client, index_name="RaptorIndex", text_key="text")

retriever = RaptorRetriever( [], embed_model=embed_model, # used for embedding clusters llm=llm_model, # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=2, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode )

query_engine = RetrieverQueryEngine.from_args( retriever, llm=llm_model )

response = query_engine.query("What baselines was RAPTOR compared against?")`

Error:-

{'data': {'Get': {'RaptorIndex': None}}, 'errors': [{'locations': [{'column': 6, 'line': 1}], 'message': 'invalid \'where\' filter: data type filter cannot use "valueInt" on type "number", use "valueNumber" instead', 'path': ['Get', 'RaptorIndex']}]}

Any issue with the above code?

Packages used: llama-index-vector-stores-weaviate = "^0.1.4" llama-index-packs-raptor = "^0.1.3" llama-index-llms-ollama = "^0.1.2" llama-index-embeddings-ollama = "^0.1.2" umap-learn = "^0.5.6"

akshayshende129 commented 4 months ago

I've implemented a solution to address the identified issue. I made some minor modifications as I encountered a similar problem. More work needed as this is just a temporary solution. The following files have been updated to reflect the changes:

from llama_index.core.vector_stores.simple import SimpleVectorStore
from llama_index.core.vector_stores.types import (
    ExactMatchFilter,
    FilterCondition,
    FilterOperator,
    MetadataFilter,
    MetadataFilters,
    MetadataInfo,
    VectorStoreQuery,
    VectorStoreQueryResult,
    VectorStoreInfo,
    ValueDataType
)

__all__ = [
    "VectorStoreQuery",
    "VectorStoreQueryResult",
    "MetadataFilters",
    "MetadataFilter",
    "MetadataInfo",
    "ExactMatchFilter",
    "FilterCondition",
    "FilterOperator",
    "SimpleVectorStore",
    "VectorStoreInfo",
    "ValueDataType"
]

class MetadataFilter(BaseModel): """Comprehensive metadata filter for vector stores to support more operators.

Value uses Strict* types, as int, float and str are compatible types and were all
converted to string before.

See: https://docs.pydantic.dev/latest/usage/types/#strict-types
"""

key: str
value: Union[
    StrictInt,
    StrictFloat,
    StrictStr,
    List[Union[StrictInt, StrictFloat, StrictStr]],
]
value_type: ValueDataType = ValueDataType.STRING
operator: FilterOperator = FilterOperator.EQ

@classmethod
def from_dict(
    cls,
    filter_dict: Dict,
) -> "MetadataFilter":
    """Create MetadataFilter from dictionary.

    Args:
        filter_dict: Dict with key, value and operator.

    """
    return MetadataFilter.parse_obj(filter_dict)

Usage:

MetadataFilters( filters=[ MetadataFilter( key="field_name", value=field_value, value_type=ValueDataType.STRING ) for field_value in fields ], condition=FilterCondition.AND )


I hope this helps! Any suggestions or feedback are appreciated