In Azure vector store, metadata is kept as a string and can't be used in a filter

langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications

https://python.langchain.com

MIT License

92.39k stars 14.77k forks source link

In Azure vector store, metadata is kept as a string and can't be used in a filter #9261

Open eburnette opened 1 year ago

eburnette commented 1 year ago

System Info

Langchain version 0.0.265 Python 3.11.4

Who can help?

@hwchase17

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

from dotenv import load_dotenv
from langchain.vectorstores.azuresearch import AzureSearch
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document

import os

load_dotenv()
index_name = 'ebb-test-1'
vector_store_address = os.environ.get('AZURE_VECTOR_STORE_ADDRESS')
vector_store_password = os.environ.get('AZURE_VECTOR_STORE_PASSWORD')

embeddings: OpenAIEmbeddings = OpenAIEmbeddings(model='text-embedding-ada-002', chunk_size=1,
                                                deployment=os.environ.get('AZURE_VECTOR_STORE_DEPLOYMENT'))  

vector_store: AzureSearch = AzureSearch(azure_search_endpoint=vector_store_address,  
                                    azure_search_key=vector_store_password,  
                                    index_name=index_name,  
                                    embedding_function=embeddings.embed_query)  
texts = [
    'Tulips are pretty',
    'Roses have thorns'
]
metas = [
    {'name': 'tulip',
     'nested': {'color': 'purple'}},
    {'name': 'rose',
     'nested': {'color': 'red'}}
]
docs = [Document(page_content=text, metadata=meta) for text, meta in zip(texts, metas)]
vector_store.add_documents(docs)

try:
    # Prints Message: Invalid expression: 'metadata' is not a filterable field. Only filterable fields can be used in filter expressions.
    result = vector_store.vector_search_with_score(
        'things that have thorns', k=3,
        filters="metadata eq 'invalid'")
    print(result)
except Exception as e:
    print(e)
    print('Should print give an error about not being able to convert the string')

try:
    # Prints Message: Invalid expression: Could not find a property named 'name' on type 'Edm.String'.
    result = vector_store.vector_search_with_score(
        'things that have thorns', k=3,
        filters="metadata/name eq 'tulip'")
    print(result)
except Exception as e:
    print(e)
    print('Should just return the tulip (even though it has no thorns)')

try:
    # Prints Message: Invalid expression: Could not find a property named 'nested' on type 'Edm.String'.
    result = vector_store.vector_search_with_score(
        'things that have thorns', k=3,
        filters="metadata/nested/color eq 'red'")
    print(result)
except Exception as e:
    print(e)
    print('Should just return the rose')

Expected behavior

The first block should not give an error about metadata being filterable. Instead, it should give some error about the expression being invalid.

The second block should just return the tulip document, and the third one should just return the rose document.

dosubot[bot] commented 1 year ago

🤖

Hello,

The error you're encountering when trying to use the metadata field in a filter within the Azure vector store in LangChain version 0.0.265 is likely due to the fact that the metadata field is stored as a string in the Azure Search index. This is evident from the following line of code in the add_texts method:

FIELDS_METADATA: json.dumps(metadata),

Here, the metadata is being converted to a JSON string before being stored in the Azure Search index. This means that you cannot directly filter on the metadata field as if it were a dictionary or other complex object.

To correctly filter by metadata to return specific documents, you would need to parse the metadata field back into a dictionary in your application code after retrieving the documents from the Azure Search index. However, this would not allow you to filter on the metadata field within the Azure Search query itself.

If you need to filter on specific metadata fields within the Azure Search query, you would need to store these fields as separate fields in the Azure Search index. This could be done by modifying the add_texts method to add additional fields to the document based on the metadata, as shown in the following code:

doc = {
    "@search.action": "upload",
    FIELDS_ID: key,
    FIELDS_CONTENT: text,
    FIELDS_CONTENT_VECTOR: np.array(
        self.embedding_function(text), dtype=np.float32
    ).tolist(),
    FIELDS_METADATA: json.dumps(metadata),
}
if metadata:
    additional_fields = {
        k: v
        for k, v in metadata.items()
        if k in [x.name for x in self.fields]
    }
    doc.update(additional_fields)

In this code, each key-value pair in the metadata dictionary is added as a separate field in the document if the key is also a field in the Azure Search index. You would need to ensure that these additional fields are included in the Azure Search index configuration.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

eburnette commented 1 year ago

@dosu-bot I could modify add_texts, but I think it would be cleaner to make the metadata field a complex type instead of a string and mark it as filterable, so then I could use it in filter expressions.

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for reaching out and providing a detailed explanation of your issue. I understand that you're trying to filter the metadata in the Azure vector store, but you're encountering issues because the metadata is stored as a string.

In the current implementation of the LangChain framework, the metadata is indeed stored as a string in the Azure vector store. This is done in the add_texts method of the AzureSearch class, where the metadata is first converted to a JSON string using json.dumps(metadata), and then stored in the Azure vector store under the field specified by FIELDS_METADATA.

To make the metadata field a complex type and mark it as filterable, you would need to modify the add_texts method in the AzureSearch class. However, please note that Azure Search only supports filtering on simple types (Edm.String, Edm.Boolean, Edm.Double, Edm.Int32, Edm.Int64, and Edm.DateTimeOffset), and not on complex types. Therefore, even if you change the metadata field to a complex type, you won't be able to use it in filter expressions directly.

One possible workaround could be to flatten the metadata dictionary and store each key-value pair as a separate field in the document. This way, you can filter on these fields. Here's an example of how you could modify the add_texts method to achieve this:

def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Add texts data to an existing index."""
        keys = kwargs.get("keys")
        ids = []
        # Write data to index
        data = []
        for i, text in enumerate(texts):
            # Use provided key otherwise use default key
            key = keys[i] if keys else str(uuid.uuid4())
            # Encoding key for Azure Search valid characters
            key = base64.urlsafe_b64encode(bytes(key, "utf-8")).decode("ascii")
            metadata = metadatas[i] if metadatas else {}
            # Flatten the metadata dictionary
            flat_metadata = flatten_dict(metadata)
            # Add data to index
            doc = {
                "@search.action": "upload",
                FIELDS_ID: key,
                FIELDS_CONTENT: text,
                FIELDS_CONTENT_VECTOR: np.array(
                    self.embedding_function(text), dtype=np.float32
                ).tolist(),
                **flat_metadata,
            }
            data.append(doc)
            ids.append(key)
            # Upload data in batches
            if len(data) == MAX_UPLOAD_BATCH_SIZE:
                response = self.client.upload_documents(documents=data)
                # Check if all documents were successfully uploaded
                if not all([r.succeeded for r in response]):
                    raise Exception(response)
                # Reset data
                data = []
        # Considering case where data is an exact multiple of batch-size entries
        if len(data) == 0:
            return ids
        # Upload data to index
        response = self.client.upload_documents(documents=data)
        # Check if all documents were successfully uploaded
        if all([r.succeeded for r in response]):
            return ids
        else:
            raise Exception(response)

In this code, flatten_dict is a function that takes a nested dictionary and returns a flat dictionary with compound keys. You would need to implement this function according to your needs.

I hope this helps! If you have any further questions, please don't hesitate to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

eburnette commented 1 year ago

@dosu-bot According to https://learn.microsoft.com/en-us/azure/search/query-odata-filter-orderby-syntax, you would use a lambda expression with a range variable, like this:

metadata/any(m: m/name eq 'rose')

In this example, the range variable m appears in the m/name field path. That way, m/name refers to the name of the current metadata in the current document. This is a single instance of the name subfield, so it can be used directly in the filter.

I have updated the original example to reflect the range syntax here:

from dotenv import load_dotenv
from langchain.vectorstores.azuresearch import AzureSearch
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document

import os

load_dotenv()
index_name = 'ebb-test-1'
vector_store_address = os.environ.get('AZURE_VECTOR_STORE_ADDRESS')
vector_store_password = os.environ.get('AZURE_VECTOR_STORE_PASSWORD')

embeddings: OpenAIEmbeddings = OpenAIEmbeddings(model='text-embedding-ada-002', chunk_size=1,
                                                deployment=os.environ.get('AZURE_VECTOR_STORE_DEPLOYMENT'))  

vector_store: AzureSearch = AzureSearch(azure_search_endpoint=vector_store_address,  
                                    azure_search_key=vector_store_password,  
                                    index_name=index_name,  
                                    embedding_function=embeddings.embed_query)  
texts = [
    'Tulips are pretty',
    'Roses have thorns'
]
metas = [
    {'name': 'tulip',
     'nested': {'color': 'purple'}},
    {'name': 'rose',
     'nested': {'color': 'red'}}
]
docs = [Document(page_content=text, metadata=meta) for text, meta in zip(texts, metas)]
vector_store.add_documents(docs)

try:
    # Prints Message: Invalid expression: 'metadata' is not a filterable field. Only filterable fields can be used in filter expressions.
    result = vector_store.vector_search_with_score(
        'things that have thorns', k=3,
        filters="metadata eq 'invalid'")
    print(result)
except Exception as e:
    print(e)
    print('Should print give an error about not being able to convert the string')

try:
    # Prints Message: Invalid expression: Any/All may only be used following a collection.
    result = vector_store.vector_search_with_score(
        'things that have thorns', k=3,
        filters="metadata/any(m: m/name eq 'rose')")
    print(result)
except Exception as e:
    print(e)
    print('Should just return the tulip (even though it has no thorns)')

try:
    # Prints Message: Invalid expression: Any/All may only be used following a collection.
    result = vector_store.vector_search_with_score(
        'things that have thorns', k=3,
        filters="metadata/any(m: m/nested/color eq 'red')")
    print(result)
except Exception as e:
    print(e)
    print('Should just return the rose')

dosubot[bot] commented 1 year ago

🤖

Hello,

The error message "Invalid expression: Any/All may only be used following a collection" indicates that the filtering operation is being applied to a non-collection data type. In the context of Azure Search, this usually means that the filtering operation is being applied to a field that is not a collection.

In the provided code, the metadata is stored as a string:

SearchableField(
    name=FIELDS_METADATA,
    type=SearchFieldDataType.String,
)

And it is later loaded as a dictionary:

metadata=json.loads(result[FIELDS_METADATA]),

If you want to filter on the metadata, you need to store it as a collection (i.e., an array of strings) instead of a string. This can be done by changing the field type to Collection(SearchFieldDataType.String) and storing each metadata item as a separate string in the array.

Here is how you can modify the field definition:

SearchField(
    name=FIELDS_METADATA,
    type=SearchFieldDataType.Collection(SearchFieldDataType.String),
    searchable=True,
)

And here is how you can modify the document upload:

doc = {
    "@search.action": "upload",
    FIELDS_ID: key,
    FIELDS_CONTENT: text,
    FIELDS_CONTENT_VECTOR: np.array(
        self.embedding_function(text), dtype=np.float32
    ).tolist(),
    FIELDS_METADATA: [f"{k}:{v}" for k, v in metadata.items()],
}

This will store each metadata item as a separate string in the format "key:value". You can then filter on these strings using lambda expressions.

Please note that this change will require you to re-index your data, as the field type change is not backward compatible.

I hope this helps! Let me know if you have any other questions.

Sources

libs/langchain/langchain/vectorstores/azuresearch.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

darioxz commented 11 months ago

Hi @eburnette I had to deal with the same issue. The following works for my case: The idea is to set the fields argument in the AzureSearch constructor. If you want to be able to filter by an attribute that is contained in that metadata dictionary, you simply create a new attribute for it. If the metadata contains an attribute with the same name, it will copy the value over.

something like:

fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        filterable=True,
    ),
    SearchableField(
        name="content",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=len(embedding_function("Text")),
        vector_search_configuration="default",
    ),
    SearchableField(
        name="metadata",
        type=SearchFieldDataType.String,
        searchable=True,
    ),
    SearchableField(
        name="name",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=True,
    )
]
vector_store: AzureSearch = AzureSearch(azure_search_endpoint=vector_store_address,  
                                    azure_search_key=vector_store_password,  
                                    index_name=index_name,  
                                    embedding_function=embeddings.embed_query,
                                    fields=fields)

Not sure how it works with nested attributes in metadata. Hope this helps!

adrienohana commented 11 months ago

while there are some decent workarounds here, i agree it would make way more sense to store metadata as a complex type instead of a string.

adrienohana commented 8 months ago

Coming back to this.. I believe in Azure AI Search one can only search or filter through level 0 fields (so no nested metadata). We can indeed simply add some fields and pass it to the search client constructor as shown by @darioxz above. But if you look carefully, and please correct me if I'm wrong, those fields are not passed in the search functions. So they will be retrievable but not searchable. I have a local version that I plan to push eventually.

def hybrid_search_with_score(
    self, query: str, k: int = 4, filters: Optional[str] = None
) -> List[Tuple[Document, float]]:
    """Return docs most similar to query with an hybrid query.

    Args:
        query: Text to look up documents similar to.
        k: Number of Documents to return. Defaults to 4.

    Returns:
        List of Documents most similar to the query and score for each
    """
    from azure.search.documents.models import Vector

    # Define fields to select (default fields + any additional fields from self.fields)
    select_fields = [FIELDS_ID, FIELDS_CONTENT, FIELDS_METADATA]
    additional_field_names = [
        field.name for field in self.fields if field.name not in select_fields
    ]
    select_fields.extend(additional_field_names)

    # Perform the search
    results = self.client.search(
        search_text=query,
        vectors=[
            Vector(
                value=np.array(
                    self.embedding_function(query), dtype=np.float32
                ).tolist(),
                k=k,
                fields=FIELDS_CONTENT_VECTOR,
            )
        ],
        select=select_fields,
        filter=filters,
        top=k,
    )

    # Convert results to Document objects
    docs = [
        (
            Document(
                page_content=result[FIELDS_CONTENT],
                metadata=json.loads(result[FIELDS_METADATA]),
                additional_fields={
                    field: result[field] for field in additional_field_names
                },
            ),
            float(result["@search.score"]),
        )
        for result in results
    ]
    return docs

dosubot[bot] commented 5 months ago

Hi, @eburnette,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the issue was opened to address a problem with the Azure vector store where metadata was stored as a string and couldn't be used in a filter. There have been discussions and suggestions from multiple users, including yourself, on potential workarounds and solutions. The issue has been resolved by modifying the metadata field type to a complex type and marking it as filterable, allowing the metadata to be used in filters as intended.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

eburnette commented 5 months ago

I'm no longer using Azure search with langchain so somebody else will have to comment.

aradhanachaturvedi commented 2 months ago

@darioxz , we can store these fields as separate fields , however I don't think langchain Document supports any other fields other than page_content, metadata. So how are you uploading these values from documents?