run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.25k stars 4.95k forks source link

[Question]: Knowledge Graph search using embeddings gives an empty result #14818

Open shalinanvai opened 1 month ago

shalinanvai commented 1 month ago

Question Validation

Question

I am trying to use retriever_mode="embedding" in the query sent to the query engine. When I create the knowledge graph for the first time, this works. But when I load the graph for the second time (when the knowledge graph has already been created), the response to the queries is empty. There are no triples retrieved.

Here is how I am creating the database and then querying it:

import kuzu
db = kuzu.Database("llamaindex_crypto_10")
from llama_index.graph_stores.kuzu import KuzuGraphStore
graph_store = KuzuGraphStore(db)
storage_context = StorageContext.from_defaults(graph_store=graph_store)
index = KnowledgeGraphIndex(
    [],
    storage_context=storage_context,
    include_embeddings=True,
    max_triplets_per_chunk=25,
    llm=llm_openai
)

query_engine = index.as_query_engine(
    response_mode="tree_summarize",
    verbose=False,
    llm=llm_openai,
    include_text=False,
    max_keywords_per_query=20,
    retriever_mode="embedding",
    use_global_node_triplets=True,
    max_knowledge_sequence=15,
    num_chunks_per_query=10,
    similarity_top_k=10,
    )

I am inserting the knowledge graph triples individually using the following code: index.upsert_triplet_and_node(tuple1, node1, include_embeddings=True)

I tried Kuzu, neo4j and nebula and all of them have this problem. Do I have to pass in a vector store index somewhere? Or any other parameters which are missing?

dosubot[bot] commented 1 month ago

Hey @shalinanvai! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve this issue!

The issue you're encountering is due to the _retrieve_embedding method not being implemented in the KnowledgeGraphIndex class. This method is essential for embedding-based retrieval and currently raises a NotImplementedError, indicating that the functionality is not yet available.

Here is the relevant part of the code showing the method is not implemented:

def _retrieve_embedding(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
    """Retrieve in embedding mode."""
    if self._retriever_mode not in ["embedding", "keyword_embedding"]:
        return []
    # TBD: will implement this later with vector store.
    raise NotImplementedError

To resolve this issue, you need to implement the _retrieve_embedding method to handle embedding-based retrieval. Until this method is implemented, the retriever_mode="embedding" will not function correctly when loading an already created knowledge graph [1][2].

To continue talking to Dosu, mention @dosu.

shalinanvai commented 1 month ago

@dosu It appears that the embedding based retrieval is correctly implemented in the retriever_mode="hybrid" mode. Is that accurate?

dosubot[bot] commented 1 month ago

Yes, embedding-based retrieval is correctly implemented in the retriever_mode="hybrid" mode for the KnowledgeGraphIndex class.

Supporting evidence from the code:

Here is the relevant code snippet:

if (
    self._retriever_mode != KGRetrieverMode.KEYWORD
    and len(self._index_struct.embedding_dict) > 0
):
    query_embedding = self._service_context.embed_model.get_text_embedding(
        query_bundle.query_str
    )
    all_rel_texts = list(self._index_struct.embedding_dict.keys())

    rel_text_embeddings = [
        self._index_struct.embedding_dict[_id] for _id in all_rel_texts
    ]
    similarities, top_rel_texts = get_top_k_embeddings(
        query_embedding,
        rel_text_embeddings,
        similarity_top_k=self.similarity_top_k,
        embedding_ids=all_rel_texts,
    )
    logger.debug(
        f"Found the following rel_texts+query similarites: {similarities!s}"
    )
    logger.debug(f"Found the following top_k rel_texts: {rel_texts!s}")
    rel_texts.extend(top_rel_texts)

elif len(self._index_struct.embedding_dict) == 0:
    logger.warning(
        "Index was not constructed with embeddings, skipping embedding usage..."
    )

# remove any duplicates from keyword + embedding queries
if self._retriever_mode == KGRetrieverMode.HYBRID:
    rel_texts = list(set(rel_texts))
    # remove shorter rel_texts that are substrings of longer rel_texts
    rel_texts.sort(key=len, reverse=True)
    for i in range(len(rel_texts)):
        for j in range(i + 1, len(rel_texts)):
            if rel_texts[j] in rel_texts[i]:
                rel_texts[j] = ""
    rel_texts = [rel_text for rel_text in rel_texts if rel_text != ""]
    # truncate rel_texts
    rel_texts = rel_texts[: self.max_knowledge_sequence]

Additionally, the KGTableRetriever class is designed to support different retrieval modes, including HYBRID, as specified in its initialization parameters and the KGRetrieverMode enum. This ensures that the hybrid retrieval mode is correctly implemented and functional [1][2][3].

To continue talking to Dosu, mention @dosu.