run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.02k stars 5.13k forks source link

(blocking) KnowledgeGraphRAGRetriever : when will hybrid search or keyword_embedding be supported ? #9560

Closed younes-io closed 6 months ago

younes-io commented 10 months ago

Question Validation

Question

As I've been looking into the docs, I realized that the KnowledgeGraphRAGRetriever doesn't support yet the keyword_embedding option, which is a blocker for my project. I'd like to know how to leverage my NebulaGraph DB and embeddings to have the best possible results in my RAG.

dosubot[bot] commented 10 months ago

🤖

Hello again, @younes-io! It's good to see you back in our LlamaIndex repository. Let's dive into your issue.

Based on the information provided, the current version of LlamaIndex (v0.9.15.post2) does support hybrid search. This is evident from the get_data_model function in the postgres.py file where a check is performed on the hybrid_search parameter. If hybrid_search is set to True, a HybridAbstractData class is created with additional attributes to support hybrid search. You can find the relevant code in the postgres.py file.

As for the keyword_embedding option, it is supported by the KnowledgeGraphRAGRetriever module in LlamaIndex. This is evident from the retriever_mode argument in the __init__ method of the KnowledgeGraphRAGRetriever class. The retriever_mode argument can take the value "keyword_embedding" among other options. You can find the relevant code in the retrievers.py file.

However, I couldn't find any evidence in the provided context to suggest that LlamaIndex supports keyword_embedding in any of its other modules. If you could provide more specific details about the module or feature where you want to use keyword_embedding, I might be able to provide a more accurate answer.

I hope this helps! If you have any other questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

younes-io commented 10 months ago

@dosu-bot I'm using NebulaGraph, and I found this btw in KnowledgeGraphRAGRetriever:

 def _retrieve_embedding(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve in embedding mode."""
        if self._retriever_mode not in ["embedding", "keyword_embedding"]:
            return []
        # TBD: will implement this later with vector store.
        raise NotImplementedError

    async def _aretrieve_embedding(
        self, query_bundle: QueryBundle
    ) -> List[NodeWithScore]:
        """Retrieve in embedding mode."""
        if self._retriever_mode not in ["embedding", "keyword_embedding"]:
            return []
        # TBD: will implement this later with vector store.
        raise NotImplementedError

I need hybridSearch for NebulaGraph. Or, another workaround, but I still need to use NebulaGraph for queries related to relationships between entities, etc

dosubot[bot] commented 10 months ago

🤖

Based on the current state of the LlamaIndex codebase, the _retrieve_embedding and _aretrieve_embedding methods in the KnowledgeGraphRAGRetriever class are indeed not implemented yet. As you've noticed, these methods are designed to support the "embedding" and "keyword_embedding" modes, but currently raise a NotImplementedError.

Unfortunately, I don't have any information on when these features will be implemented as it's not specified in the repository. However, I can suggest a potential workaround for your use case.

You could consider implementing these methods yourself. The comments in the code suggest that the implementation will involve a vector store. This could mean that the methods are intended to retrieve embeddings from a store of precomputed vector representations of the nodes in the knowledge graph. These embeddings could then be used to perform similarity searches to find nodes that are relevant to a given query.

Here's a rough sketch of how you might implement the _retrieve_embedding method:

def _retrieve_embedding(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    """Retrieve in embedding mode."""
    if self._retriever_mode not in ["embedding", "keyword_embedding"]:
        return []

    # Retrieve the query from the query bundle
    query = query_bundle.query

    # Compute the query's embedding using your chosen method
    query_embedding = compute_query_embedding(query)

    # Retrieve the top k most similar nodes from the vector store
    top_k_nodes = self.vector_store.most_similar(query_embedding, k=self.top_k)

    # Convert the results into NodeWithScore objects
    nodes_with_scores = [NodeWithScore(node=node, score=score) for node, score in top_k_nodes]

    return nodes_with_scores

This is a very simplified example and the actual implementation would depend on the specifics of your use case and the capabilities of your vector store. You would also need to implement a similar method for the _aretrieve_embedding method, which is the asynchronous version of _retrieve_embedding.

I hope this helps! If you have any further questions or need more clarification, feel free to ask. Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

younes-io commented 10 months ago

@dosu-bot Then, provide a proposal for the NebulaGraphStore

dosubot[bot] commented 10 months ago

🤖

To implement the _retrieve_embedding and _aretrieve_embedding methods in the KnowledgeGraphRAGRetriever class for a NebulaGraphStore to support "embedding" and "keyword_embedding" modes in NebulaGraph with LlamaIndex, you can follow the steps below:

  1. _retrieve_embedding method: This method will be used to retrieve the embeddings for a given query. You can use the get_top_k_embeddings function from llama_index.indices.query.embedding_utils to get the top K embeddings for the query. The method can be implemented as follows:
def _retrieve_embedding(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    """Get nodes for response using embeddings."""
    embeddings = get_top_k_embeddings(query_bundle.query_str, self.similarity_top_k)
    node_ids = self._index_struct.search_node_by_embedding(embeddings)
    nodes = self._docstore.get_nodes(node_ids)
    nodes_with_scores = [NodeWithScore(node=node, score=DEFAULT_NODE_SCORE) for node in nodes]
    return nodes_with_scores
  1. _aretrieve_embedding method: This method will be used to asynchronously retrieve the embeddings for a given query. You can use the get_top_k_embeddings function from llama_index.indices.query.embedding_utils to get the top K embeddings for the query. The method can be implemented as follows:
async def _aretrieve_embedding(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    """Get nodes for response using embeddings asynchronously."""
    embeddings = get_top_k_embeddings(query_bundle.query_str, self.similarity_top_k)
    node_ids = await self._index_struct.search_node_by_embedding(embeddings)
    nodes = await self._docstore.get_nodes(node_ids)
    nodes_with_scores = [NodeWithScore(node=node, score=DEFAULT_NODE_SCORE) for node in nodes]
    return nodes_with_scores

Please note that the get_top_k_embeddings function and the search_node_by_embedding method are hypothetical and need to be implemented according to your specific requirements and the structure of your data. The get_top_k_embeddings function should return the top K embeddings for a given query, and the search_node_by_embedding method should return the node IDs that have the closest embeddings to the given embeddings.

Regarding the QueryBundle object, it is a data structure that encapsulates the information needed to perform a query in the LlamaIndex system. Although the exact structure of QueryBundle is not provided in the given context, it is used in several methods and we can infer some of its properties:

The _retrieve_embedding and _aretrieve_embedding methods are not present in the given context. However, the _retrieve and _aretrieve methods are present, which use the QueryBundle object to perform synchronous and asynchronous retrieval of nodes respectively. The QueryBundle object is passed as an argument to these methods, and its properties are used to retrieve the appropriate embeddings and perform the query.

You can find more details in the llama_index/indices/knowledge_graph/retrievers.py and llama_index/indices/multi_modal/retriever.py files.

Please let me know if you need further clarification or assistance.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

younes-io commented 10 months ago

@dosu-bot okay, what about keywords ? Are they case sensitive ? if the user says "hello", is "Hello" part of the retrieved results or not ? If not, what can I do to configure NebulaGraph to be case-insensitive ?

dosubot[bot] commented 10 months ago

🤖

In the LlamaIndex framework, the case sensitivity of keywords largely depends on the configuration of the underlying database, in this case, NebulaGraph. However, I couldn't find specific information about this in the LlamaIndex repository.

To better assist you, could you please provide more details about your current setup? Specifically, it would be helpful to know the version of NebulaGraph you're using and any relevant configuration settings you've applied. Additionally, if you could share a sample of the query you're using, that would be very beneficial.

Once I have this information, I'll be able to provide a more accurate and helpful response.


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.