run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.75k stars 4.9k forks source link

Elasticsearch embedding retireval issue #13971

Open chirantandatta opened 2 months ago

chirantandatta commented 2 months ago

Question Validation

Question

I created elasticsearch persistent embedding using the following code: vector_store = ElasticsearchStore( index_name="dense_index_new", es_url="http://localhost:9200" ) if not os.path.exists("./elastic_db"): storage_context = StorageContext.from_defaults(vector_store=vector_store) vector_index = VectorStoreIndex( documents, service_context=service_context, show_progress=True,storage_context=storage_context ) self.vector_index.storage_context.persist(persist_dir="./elastic_db") else: storage_context = StorageContext.from_defaults(persist_dir="./elastic_db") vector_index=load_index_from_storage(storage_context)

I am able to create persistent embeddings in folder named as elastic_db but for the else conditon I am getting the issue: return self.vector_stores[DEFAULT_VECTOR_STORE] KeyError: 'default' My aim is to create the embeddings folder one time and share it to other person so that other person can use the embeddings using the else condition.How to resolve the error?

dosubot[bot] commented 2 months ago

Hey there, @chirantandatta! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this together!

To resolve the KeyError: 'default' in your else condition, you need to ensure that the DEFAULT_VECTOR_STORE key is properly handled when loading the storage context. Here’s how you can modify your code:

  1. Ensure the DEFAULT_VECTOR_STORE key is set when loading the storage context: When you load the storage context from the persisted directory, you need to make sure that the vector_store is correctly initialized.

  2. Use the get method to safely access dictionary keys: This prevents the KeyError if the key is not found.

Here’s an updated version of your code:

from llama_index.core.storage.storage_context import StorageContext
from llama_index.core.indices.vector_store.base import VectorStoreIndex
from llama_index.legacy.indices.loading import load_index_from_storage
import os

vector_store = ElasticsearchStore(
    index_name="dense_index_new",
    es_url="http://localhost:9200"
)

if not os.path.exists("./elastic_db"):
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    vector_index = VectorStoreIndex(
        documents, service_context=service_context, show_progress=True, storage_context=storage_context
    )
    vector_index.storage_context.persist(persist_dir="./elastic_db")
else:
    storage_context = StorageContext.from_defaults(persist_dir="./elastic_db")
    # Ensure the vector store is set correctly
    if 'default' not in storage_context.vector_stores:
        storage_context.vector_stores['default'] = vector_store
    vector_index = load_index_from_storage(storage_context)

Explanation:

This approach ensures that the DEFAULT_VECTOR_STORE key is always available, preventing the KeyError [1][2][3].

To continue talking to Dosu, mention @dosu.

chirantandatta commented 2 months ago

@dosu query_engine=vector_index.query_engine() response=query_engine.query(question) For response generation,I am getting this error --> 352 raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( 353 message=message, meta=meta, body=resp_body 354 ) 356 # 'X-Elastic-Product: Elasticsearch' should be on every 2XX response. 357 if not self._verified_elasticsearch: 358 # If the header is set we mark the server as verified.

BadRequestError: BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [knn].')

How to solve this

dosubot[bot] commented 2 months ago

To resolve the BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [knn].') error, you need to ensure that the knn query is correctly formatted. The error suggests that there is an issue with the structure of the knn query object.

The knn query should be part of the query object, not a top-level key. Here is the corrected structure:

if query.mode in (
    VectorStoreQueryMode.DEFAULT,
    VectorStoreQueryMode.HYBRID,
):
    es_query["query"] = {
        "knn": {
            "filter": filter,
            "field": self.vector_field,
            "query_vector": query_embedding,
            "k": query.similarity_top_k,
            "num_candidates": query.similarity_top_k * 10,
        }
    }

Make sure to update the aquery method to use this corrected structure:

async def aquery(
    self,
    query: VectorStoreQuery,
    custom_query: Optional[
        Callable[[Dict, Union[VectorStoreQuery, None]], Dict]
    ] = None,
    es_filter: Optional[List[Dict]] = None,
    **kwargs: Any,
) -> VectorStoreQueryResult:
    query_embedding = cast(List[float], query.query_embedding)

    es_query = {}
    if query.filters is not None and len(query.filters.legacy_filters()) > 0:
        filter = [_to_elasticsearch_filter(query.filters)]
    else:
        filter = es_filter or []

    if query.mode in (
        VectorStoreQueryMode.DEFAULT,
        VectorStoreQueryMode.HYBRID,
    ):
        es_query["query"] = {
            "knn": {
                "filter": filter,
                "field": self.vector_field,
                "query_vector": query_embedding,
                "k": query.similarity_top_k,
                "num_candidates": query.similarity_top_k * 10,
            }
        }

    if query.mode in (
        VectorStoreQueryMode.TEXT_SEARCH,
        VectorStoreQueryMode.HYBRID,
    ):
        es_query["query"] = {
            "bool": {
                "must": {"match": {self.text_field: {"query": query.query_str}}},
                "filter": filter,
            }
        }

    if query.mode == VectorStoreQueryMode.HYBRID:
        es_query["rank"] = {"rrf": {}}

    if custom_query is not None:
        es_query = custom_query(es_query, query)
        logger.debug(f"Calling custom_query, Query body now: {es_query}")

    async with self.client as client:
        response = await client.search(
            index=self.index_name,
            **es_query,
            size=query.similarity_top_k,
            _source={"excludes": [self.vector_field]},
        )

    top_k_nodes = []
    top_k_ids = []
    top_k_scores = []
    hits = response["hits"]["hits"]
    for hit in hits:
        source = hit["_source"]
        metadata = source.get("metadata", None)
        text = source.get(self.text_field, None)
        node_id = hit["_id"]
        try:
            node = metadata_dict_to_node(metadata)
            node.text = text
        except Exception:
            logger.warning(
                f"Could not parse metadata from hit {hit['_source']['metadata']}"
            )
            node_info = source.get("node_info")
            relationships = source.get("relationships") or {}
            start_char_idx = None
            end_char_idx = None
            if isinstance(node_info, dict):
                start_char_idx = node_info.get("start", None)
                end_char_idx = node_info.get("end", None)

            node = TextNode(
                text=text,
                metadata=metadata,
                id_=node_id,
                start_char_idx=start_char_idx,
                end_char_idx=end_char_idx,
                relationships=relationships,
            )
        top_k_nodes.append(node)
        top_k_ids.append(node_id)
        top_k_scores.append(hit.get("_rank", hit["_score"]))

    if query.mode == VectorStoreQueryMode.HYBRID:
        total_rank = sum(top_k_scores)
        top_k_scores = [total_rank - rank / total_rank for rank in top_k_scores]

    return VectorStoreQueryResult(
        nodes=top_k_nodes,
        ids=top_k_ids,
        similarities=_to_llama_similarities(top_k_scores),
    )

This should resolve the BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [knn].') error [1].

To continue talking to Dosu, mention @dosu.

chirantandatta commented 2 months ago

@dosu this is an internal python file. I can't change that. Is there any other way to solve the problem

dosubot[bot] commented 2 months ago

To resolve the BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [knn].') error without modifying the internal Python file, you can ensure that the request being sent to Elasticsearch is correctly formatted and that the Elasticsearch instance is properly configured to handle KNN queries.

Here are the steps you can take:

  1. Check Elasticsearch Version and Plugins: Ensure that your Elasticsearch instance supports KNN queries. This typically requires the Elasticsearch KNN plugin to be installed and the correct version of Elasticsearch (version ^8.12.0 is required) [1].

  2. Correct Query Format: Ensure that the query being sent to Elasticsearch is correctly formatted for KNN. This might involve checking the structure of the query and ensuring it matches the expected format for KNN queries in Elasticsearch.

  3. Update Elasticsearch Configuration: Make sure that the Elasticsearch instance is configured to handle KNN queries. This might involve updating the Elasticsearch configuration files or settings.

Here is an example of how you might set up the ElasticsearchEmbedding and ensure the query is correctly formatted:

import pytest
from llama_index.embeddings.elasticsearch import ElasticsearchEmbedding

try:
    import elasticsearch
except ImportError:
    elasticsearch = None  # type: ignore

@pytest.fixture()
def model_id() -> str:
    # Replace with your actual model_id
    return "your_model_id"

@pytest.fixture()
def es_url() -> str:
    # Replace with your actual Elasticsearch URL
    return "http://localhost:9200"

@pytest.fixture()
def es_username() -> str:
    # Replace with your actual Elasticsearch username
    return "foo"

@pytest.fixture()
def es_password() -> str:
    # Replace with your actual Elasticsearch password
    return "bar"

@pytest.mark.skipif(elasticsearch is None, reason="elasticsearch not installed")
def test_elasticsearch_embedding_constructor(
    model_id: str, es_url: str, es_username: str, es_password: str
) -> None:
    """Test Elasticsearch embedding query."""
    es_embedding = ElasticsearchEmbedding.from_credentials(
        model_id=model_id,
        es_url=es_url,
        es_username=es_username,
        es_password=es_password,
    )

    # Example of a correctly formatted KNN query
    query = {
        "size": 10,
        "query": {
            "knn": {
                "field_name": {
                    "vector": [0.1, 0.2, 0.3],  # Example vector
                    "k": 10
                }
            }
        }
    }

    response = es_embedding.client.search(index="your_index", body=query)
    print(response)

Ensure that the field_name and vector in the query match the actual field and vector data you are using. Additionally, make sure that the Elasticsearch instance has the necessary embedding model loaded and deployed as described in the ElasticsearchEmbedding class documentation. This should help resolve the BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [knn].') error [2][3].

To continue talking to Dosu, mention @dosu.

chirantandatta commented 2 months ago

@dosu I am not using elasticsearchembeddings and any model of elasticsearch for response generation.I am using different model and different embedding and elasticsearch I am using for storage_context purpose only. How to resolve that error with this condition

dosubot[bot] commented 2 months ago

To resolve the BadRequestError(400, 'parsing_exception', 'Unknown key for a START_OBJECT in [knn].') error when using Elasticsearch for storage context, ensure that your Elasticsearch version supports the knn query. The knn query is available starting from Elasticsearch version 8.0.0.

Here is the relevant part of the code where the knn query is being used:

if query.mode in (
    VectorStoreQueryMode.DEFAULT,
    VectorStoreQueryMode.HYBRID,
):
    es_query["knn"] = {
        "filter": filter,
        "field": self.vector_field,
        "query_vector": query_embedding,
        "k": query.similarity_top_k,
        "num_candidates": query.similarity_top_k * 10,
    }

Ensure that your Elasticsearch instance is version 8.0.0 or later. If you are using an older version, you will need to upgrade to a version that supports the knn query.

If you are already using a compatible version and still encountering the error, double-check the structure of the knn query to ensure it matches the expected format for your Elasticsearch version [1].

To continue talking to Dosu, mention @dosu.

chirantandatta commented 2 months ago

@dosu that problem is solved by upgrading elasticsearch version.now getting this error ValueError: Query id f53e054e-57f1-418e-9b2e-a9b22050d6b8 not found in either retriever_dict or query_engine_dict. I am using retrieverqueryengine for response generation.how to solve this error

dosubot[bot] commented 2 months ago

To resolve the ValueError: Query id f53e054e-57f1-418e-9b2e-a9b22050d6b8 not found in either 'retriever_dict' or 'query_engine_dict' error when using RetrieverQueryEngine for response generation, ensure that the query ID is correctly registered in both retriever_dict and query_engine_dict. This error typically occurs when the query ID is not properly tracked or initialized.

Here is a possible approach to ensure the query ID is correctly managed:

  1. Check Initialization: Ensure that the RetrieverQueryEngine is properly initialized with the correct retriever and response_synthesizer.

  2. Register Query ID: Make sure the query ID is registered in both retriever_dict and query_engine_dict.

Here is an example of how you might initialize and use the RetrieverQueryEngine:

from llama_index.core import Document, ServiceContext, TreeIndex
from llama_index.core.indices.tree.select_leaf_retriever import TreeSelectLeafRetriever
from llama_index.core.query_engine.retriever_query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI  # Ensure you have this installed

# Initialize the necessary components
documents = [Document(text="Hi")]
gpt35turbo_predictor = OpenAI(
    temperature=0,
    model_name="gpt-3.5-turbo-0613",
    streaming=True,
    openai_api_key="your-openai-api-key",
)
gpt35_sc = ServiceContext.from_defaults(
    llm=gpt35turbo_predictor,
    chunk_size=512,
)

# Create the TreeIndex and Retriever
gpt35_tree_index = TreeIndex.from_documents(documents, service_context=gpt35_sc)
retriever = TreeSelectLeafRetriever(index=gpt35_tree_index, child_branch_factor=2)

# Initialize the RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever=retriever)

# Register the query ID in the appropriate dictionaries
query_id = "f53e054e-57f1-418e-9b2e-a9b22050d6b8"
retriever_dict[query_id] = retriever
query_engine_dict[query_id] = query_engine

# Now you can use the query engine to process queries
query_bundle = QueryBundle(query_str="Your query here")
response = query_engine._query(query_bundle)

Ensure that the query_id is consistent and correctly registered in both dictionaries before invoking the query engine. This should resolve the ValueError related to the missing query ID [1][2][3].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 months ago

@chirantandatta if you are using elastic search, everything is saved inside elastic search. In fact, you'll probably find the files in the saved folder rather empty

No need to call persist. Just create the vector store object, and do VectorStoreIndex.from_vector_store(vector_store)