RequestError: RequestError(400, 'search_phase_execution_exception...) with ElasticSearch as Vector store when querying

lauradang commented 10 months ago

System Info

Langchain version: 0.0.254 Python version: 3.10.2 Elasticsearch version: 7.17.0 System Version: macOS 13.4 (22F66) Model Name: MacBook Pro Model Identifier: Mac14,10 Chip: Apple M2 Pro Total Number of Cores: 12 (8 performance and 4 efficiency) Memory: 32 GB

Who can help?

@agola11 @hwchase17

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[X] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[X] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

run this using python3 script.py

script.py:

import os

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch

def main():
    text_path = "some-test.txt"
    loader = TextLoader(text_path)
    data = loader.load()

    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=1000, chunk_overlap=0
    ) # I have also tried various chunk sizes, but still have the same error

    documents = text_splitter.split_documents(data)

    api_key = "..."
    embeddings = OpenAIEmbeddings(openai_api_key=api_key)

    os.environ["ELASTICSEARCH_URL"] = "..."
    db = ElasticVectorSearch.from_documents(
        documents,
        embeddings,
        index_name="laurad-test",
    )
    print(db.client.info())

    db = ElasticVectorSearch(
        index_name="laurad-test",
        embedding=embeddings,
        elasticsearch_url="..."
    )

    qa = RetrievalQA.from_chain_type(
        llm=ChatOpenAI(temperature=0, openai_api_key=api_key),
        chain_type="stuff",
        retriever=db.as_retriever(),
    )

    query = "Hi
    qa.run(query) # Error here

if __name__ == "__main__":
    main()

Error traceback:

RequestError                              Traceback (most recent call last)
Cell In[8], line 2
      1 query = "What is ARB in NVBugs?"
----> 2 qa.run(query)

File [/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py:451](https://file+.vscode-resource.vscode-cdn.net/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py:451), in Chain.run(self, callbacks, tags, metadata, *args, **kwargs)
    449     if len(args) != 1:
    450         raise ValueError("`run` supports only one positional argument.")
--> 451     return self(args[0], callbacks=callbacks, tags=tags, metadata=metadata)[
    452         _output_key
    453     ]
    455 if kwargs and not args:
    456     return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
    457         _output_key
    458     ]

File [/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py:258](https://file+.vscode-resource.vscode-cdn.net/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/base.py:258), in Chain.__call__(self, inputs, return_only_outputs, callbacks, tags, metadata, include_run_info)
    256 except (KeyboardInterrupt, Exception) as e:
    257     run_manager.on_chain_error(e)
--> 258     raise e
    259 run_manager.on_chain_end(outputs)
    260 final_outputs: Dict[str, Any] = self.prep_outputs(
    261     inputs, outputs, return_only_outputs
    262 )
...
--> 315 raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
    316     status_code, error_message, additional_info
    317 )

RequestError: RequestError(400, 'search_phase_execution_exception', "class_cast_exception: class org.elasticsearch.index.fielddata.ScriptDocValues$Doubles cannot be cast to class org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues (org.elasticsearch.index.fielddata.ScriptDocValues$Doubles is in unnamed module of loader 'app'; org.elasticsearch.xpack.vectors.query.VectorScriptDocValues$DenseVectorScriptDocValues is in unnamed module of loader java.net.FactoryURLClassLoader @7808fb9)")

Expected behavior

I expect to ask questions and have answers provided back using the langchain.chains.retrieval_qa.base.RetrievalQA class. However, I am getting an error 400 when trying to query the LLM.

Note: I do not get the same error when using ChromaDB or OpenSearch as the retriever.

dosubot[bot] commented 10 months ago

Answer generated by a 🤖

Answer

Hello,

Based on the error message you provided, it seems like there's a type mismatch between the vectors stored in your Elasticsearch index and the vectors you're trying to query with. The error message suggests that the system is trying to cast ScriptDocValues$Doubles to VectorScriptDocValues$DenseVectorScriptDocValues, which is causing the error.

This could be due to a few reasons:

The Elasticsearch index you're querying doesn't exist or is not properly set up. You can check this by using the client.indices.get(index=self.index_name) method in the ElasticVectorSearch class.
The query vector you're using to search the Elasticsearch index is not correctly formatted or is not compatible with the vectors stored in the index. The _default_knn_query function in the ElasticVectorSearch class is used to create the query for Elasticsearch. Make sure that the query_vector argument passed to this function is a list of floats and that its dimension matches the dimension of the vectors stored in the Elasticsearch index.
The filter you're using in the _default_script_query function is not correctly formatted. If a filter is provided, it should be a dictionary with one key-value pair. The key should be the name of a field in the Elasticsearch index and the value should be the value to filter by for that field.
If you're using the knn_search or knn_hybrid_search methods, ensure that the fields parameter includes the page_content field if source=False. If source=False and page_content is not included in fields, a ValueError will be raised.

You can find more details about these methods in the ElasticVectorSearch class.

Also, there was a similar issue reported in the past: vectorstores error: "search_phase_execution_exceptionm" after using elastic search. This issue was resolved by a pull request: fix: elasticsearch. Unfortunately, I don't have the specific details of the changes made in this pull request, but it might be worth looking into.

Please let me know if this helps or if you need further assistance.

Sources

libs/langchain/langchain/vectorstores/elastic_vector_search.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

lauradang commented 10 months ago

@agola11 @hwchase17 Could I get some help on this issue?

joemcelroy commented 10 months ago

Hi @lauradang

Could you try using ElasticsearchStore instead? Its the new Vector store implementation for Elasticsearch. https://python.langchain.com/docs/integrations/vectorstores/elasticsearch#whats-new

Let me know how you get on.

Thanks!

DrSebastianK commented 9 months ago

Hey! Unfortunately the issue still persists with the new ElasticsearchStore implementation. Any new idea?

joemcelroy commented 9 months ago

hey @DrSebastianK,

Could you share what version of Elasticsearch you're running on? Also how you have initialised ElasticsearchStore, or if possible a example colab notebook of it not working for you?

Joe

DrSebastianK commented 9 months ago

Thanks for the fast reply. I am running it on v 8.9.2. The code: db = ElasticsearchStore( es_cloud_id="MY-ID", index_name="search-tmd", embedding=hf, es_user="elastic", es_password="mypassword", strategy=ElasticsearchStore.ApproxRetrievalStrategy( hybrid=True ) ) Query: db.similarity_search(query="textofmyquery", k=4)

joemcelroy commented 9 months ago

does it work without hybrid? Could you provide the full stacktrace?

DrSebastianK commented 9 months ago

No, it doesn't work without it. Same error message:

BadRequestError Traceback (most recent call last) in <cell line: 1>() ----> 1 db.similarity_search(query="what is tmd?", k=10)

4 frames /usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py in perform_request(self, method, path, params, headers, body) 318 pass 319 --> 320 raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( 321 message=message, meta=meta, body=resp_body 322 )

BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: [knn] queries are only supported on [dense_vector] fields')

joemcelroy commented 9 months ago

so theres an issue with the index that has been setup. The error is complaining that the field used to search vectors isn't a dense vector field.

Could you delete the search-tmd and let langchain ElasticsearchStore to re-create & index documents?

disoardi commented 7 months ago

Any new about this issue?

joemcelroy commented 7 months ago

@disoardi search_phase_execution_exception is a very general error. Could you give more details on your issue?

langchain-ai / langchain