InvalidVectorQuery error when using AzureSearch with vector db

DSgUY commented 1 year ago

System Info

langchain==0.0.234, windows 10, azure-identity==1.13.0, Python 3.11.4

Who can help?

I manage to create an index in Azure Cognitive Search with id, content, _vectorcontent and metadata fields. I check that docs and chunks are not nulls. I'm getting and error when querying the vector store.

docs: azuresearch-langchain-example

Any fix for this? @hwchase17 @agola11 Regards

Information

[X] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Embedding is working as I test:

# Check that embbding is working
input_text = "This is for demonstration."
outcome = embeddings.embed_query(input_text)

When I'm trying to query with:

# Perform a similarity search
docs = vector_store.similarity_search(
    query="What did the president say about Ketanji Brown Jackson",
    k=3,
    search_type="similarity",
)
print(docs[0].page_content)

Error:

HttpResponseError: (InvalidRequestParameter) The 'value' property of the vector query can't be null or an empty array. Make sure to enclose the vector within a "value" property: '{"vector": { "value": [  ] } }'
Parameter name: vector
Code: InvalidRequestParameter
Message: The 'value' property of the vector query can't be null or an empty array. Make sure to enclose the vector within a "value" property: '{"vector": { "value": [  ] } }'
Parameter name: vector
Exception Details:  (InvalidVectorQuery) The 'value' property of the vector query can't be null or an empty array. Make sure to enclose the vector within a "value" property: '{"vector": { "value": [  ] } }'
    Code: InvalidVectorQuery
    Message: The 'value' property of the vector query can't be null or an empty array. Make sure to enclose the vector within a "value" property: '{"vector": { "value": [  ] } }'

Expected behavior

I can't define is is the azure cognitive configuration index that i manually add or a bug in the code. Splitting and adding chunks to the vector store (Azure Cognitive Search) all where dont without any warning.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

From your description, it appears that you're encountering an InvalidVectorQuery error when trying to perform a similarity search using the AzureSearch vector store in the LangChain framework. The error message suggests that the 'value' property of the vector query is either null or an empty array, which is causing the issue.

To address this, let's first try to identify where the problem is originating from. The similarity_search method in LangChain's AzureSearch class uses the embed_query function to convert your query text into a vector. If this function isn't returning a valid vector, it could be the root cause of the error.

Here's a way to check this:

# Generate the query vector
query_vector = embeddings.embed_query("What did the president say about Ketanji Brown Jackson")
print(query_vector)

# Perform a similarity search
docs = vector_store.similarity_search(
    query="What did the president say about Ketanji Brown Jackson",
    k=3,
    search_type="similarity",
)
print(docs[0].page_content)

If the query_vector is null or an empty array, then the issue might be with the embed_query function. In this case, you might need to review the configuration of your OpenAIEmbeddings instance or the model you're using for embeddings.

On the other hand, if the query_vector is not null or empty, then the problem could be with the similarity_search method in the AzureSearch class. The similarity_search method uses the embedding_function to convert the query text into a vector. If this function isn't returning a valid vector, it could be causing the error. You might need to review the implementation of this function in the LangChain framework.

Please note that this is just one potential solution based on the information provided. If you find that this solution is correct and it's a bug that could impact other users, we encourage you to make a pull request to help improve the LangChain framework. If this solution doesn't align with your experience, it might be helpful to ensure you're using the latest version of LangChain.

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

DSgUY commented 1 year ago

Well the problem is in the vector parameter in each function that uses the embedding_function. I manage to create a working code using:

from typing import Any
from langchain.vectorstores.azuresearch import AzureSearch
from azure.search.documents.models import Vector
from langchain.docstore.document import Document
import numpy as np
from typing import Any, List, Optional, Tuple

class MyAzureSearch(AzureSearch):
    def vector_search_with_score(
        self, query: str, k: int = 4, filters: Optional[str] = None
    ) -> List[Tuple[Document, float]]:
        results = self.client.search(
            search_text=None,
            vector=self.embedding_function(query),
            top_k=k,
            vector_fields=FIELDS_CONTENT_VECTOR,
            select=[f"{FIELDS_ID},{FIELDS_CONTENT},{FIELDS_METADATA}"],
            filter=filters,
        )

        # Convert results to Document objects
        docs = [
            (
                Document(
                    page_content=result[FIELDS_CONTENT],
                    metadata=json.loads(result[FIELDS_METADATA]),
                ),
                float(result["@search.score"]),
            )
            for result in results
        ]

        return docs

vector_store: MyAzureSearch = MyAzureSearch(
    azure_search_endpoint=AZURE_SEARCH_SERVICE_ENDPOINT,
    azure_search_key=AZURE_SEARCH_API_KEY,
    index_name="langchain-vector-demo",
    embedding_function=embeddings.embed_query
)

# Perform a similarity search
docs = vector_store.similarity_search(
    query="What did the president say about Ketanji Brown Jackson",
    k=3,
    search_type="similarity",
)

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

From your description, it appears that you're experiencing an InvalidVectorQuery error when using the similarity_search function in the AzureSearch vector store of the LangChain framework. This issue seems to be related to how the vector parameter is being passed in each function that uses the embedding_function.

Your workaround, where you've created a new class MyAzureSearch that inherits from AzureSearch and overrides the vector_search_with_score function, is a clever solution. You've altered the way the vector parameter is passed to the search method, which seems to have resolved the issue.

As I understand it, you're directly passing the result of the embedding_function to the vector parameter, instead of creating a Vector instance. This approach appears to be effective in your case.

However, please note that while this solution works for you, it might not be the best or only solution. There could be other factors at play that we're not aware of, and this solution might not work in all scenarios or for all users.

If you believe this is a bug that could impact other users and your solution could be a potential fix, you're encouraged to contribute to the LangChain project. You could consider making a pull request with your changes. This would allow the community to review and potentially incorporate your solution into the framework.

If you find that your experience contradicts this advice, it might be worth checking if you're using the latest version of LangChain. Sometimes, issues are resolved in newer versions, so upgrading could potentially solve your problem.

Sources

langchain/vectorstores/azuresearch.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

finnless commented 1 year ago

@DSgUY Don't you also need to import the fields for your example fix to work?

from langchain.vectorstores.azuresearch import FIELDS_ID, FIELDS_CONTENT, FIELDS_METADATA, FIELDS_CONTENT_VECTOR
import json

DSgUY commented 1 year ago

@DSgUY Don't you also need to import the fields for your example fix to work?
from langchain.vectorstores.azuresearch import FIELDS_ID, FIELDS_CONTENT, FIELDS_METADATA, FIELDS_CONTENT_VECTOR
import json

yes, sorry. I was using:

FIELDS_ID = os.getenv('FIELDS_ID')
FIELDS_CONTENT = os.getenv('FIELDS_CONTENT')
FIELDS_CONTENT_VECTOR = os.getenv('FIELDS_CONTENT_VECTOR')
FIELDS_METADATA = os.getenv('FIELDS_METADATA')

but that seems better. Thanks!

JasonIsaac commented 1 year ago

Well the problem is in the vector parameter in each function that uses the embedding_function. I manage to create a working code using:

from typing import Any
from langchain.vectorstores.azuresearch import AzureSearch
from azure.search.documents.models import Vector
from langchain.docstore.document import Document
import numpy as np
from typing import Any, List, Optional, Tuple

class MyAzureSearch(AzureSearch):
    def vector_search_with_score(
        self, query: str, k: int = 4, filters: Optional[str] = None
    ) -> List[Tuple[Document, float]]:
        results = self.client.search(
            search_text=None,
            vector=self.embedding_function(query),
            top_k=k,
            vector_fields=FIELDS_CONTENT_VECTOR,
            select=[f"{FIELDS_ID},{FIELDS_CONTENT},{FIELDS_METADATA}"],
            filter=filters,
        )

        # Convert results to Document objects
        docs = [
            (
                Document(
                    page_content=result[FIELDS_CONTENT],
                    metadata=json.loads(result[FIELDS_METADATA]),
                ),
                float(result["@search.score"]),
            )
            for result in results
        ]

        return docs

vector_store: MyAzureSearch = MyAzureSearch(
    azure_search_endpoint=AZURE_SEARCH_SERVICE_ENDPOINT,
    azure_search_key=AZURE_SEARCH_API_KEY,
    index_name="langchain-vector-demo",
    embedding_function=embeddings.embed_query
)

# Perform a similarity search
docs = vector_store.similarity_search(
    query="What did the president say about Ketanji Brown Jackson",
    k=3,
    search_type="similarity",
)

Hi @DSgUY, Thanks for this solution. This works when the search type is similarity but for hybrid it is not working. Please let me know if something needs to be added to the implementation.

dosubot[bot] commented 1 year ago

Hi, @DSgUY! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you encountered an "InvalidVectorQuery" error when querying the vector store in Azure Cognitive Search. It was suggested to check the embed_query function and the embedding_function in the AzureSearch class to ensure they are returning valid vectors. You provided a workaround by creating a new class MyAzureSearch that overrides the vector_search_with_score function. However, it was mentioned by JasonIsaac that the solution works for similarity search type but not for hybrid.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and we appreciate your understanding. Let us know if you have any further questions or concerns!

langchain-ai / langchain