run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.29k stars 4.66k forks source link

[Bug]: Cannot construct neo4j knowledge graph with embeddings #10767

Open j0sephsasson opened 4 months ago

j0sephsasson commented 4 months ago

Bug Description

There seems to be a disconnect between the knowledge graph construction and using embeddings when using an external graph store (such as neo4j in this case).

I am trying to construct a neo4j knowledge graph, when calling 'from_documents', I set the include_embeddings=True.

When I use the index to query, the logs will print 'index not constructed with embeddings', even though it was.

However, when I use a 'SimpleGraphStore' and persist the storage context locally it works fine, and utilizes the embeddings.

I have multiple thousands of documents so persisting locally and loading the index for each request is not feasible. How can I use neo4j knowledge graph and also utilize embeddings?

Version

0.9.5

Steps to Reproduce

username = "neo4j"
password = "<pw>"
url = "<url>"
database = "neo4j"

graph_store = Neo4jGraphStore(
    username=username,
    password=password,
    url=url,
    database=database,
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)

llm = AzureOpenAI(engine="GPT4x32k", model="gpt-4-32k")

embedding_llm = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="Embedding",
    api_key=os.getenv('OPENAI_API_KEY'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_version=os.getenv('OPENAI_API_VERSION'),
)

service_context = ServiceContext.from_defaults(embed_model=embedding_llm, llm=llm)

index = KnowledgeGraphIndex.from_documents(
    docs,
    storage_context=storage_context,
    service_context=service_context,
    include_embeddings=True
)

NEW SESSION

index = KnowledgeGraphIndex.from_documents(
    [],
    storage_context=storage_context,
    service_context=service_context,
    include_embeddings=True
)

query_engine = index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
    embedding_mode="hybrid",
    similarity_top_k=5,
)

response = query_engine.query(
    "<query>"
)

**LOGS:**
WARNING:llama_index.indices.knowledge_graph.retrievers:Index was not constructed with embeddings, skipping embedding usage...

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 4 months ago

πŸ€–

Hey there @j0sephsasson! Great to see you back. Hope you're doing well today.

Based on the information you've provided, it seems like you're encountering a warning message that says 'index not constructed with embeddings' when querying, even though the index was constructed with embeddings. This warning message is likely due to the fact that the vector stores for text and images are empty. This can happen if no nodes are passed to the _add_nodes_to_index or _async_add_nodes_to_index methods, or if the nodes passed do not contain any text or images.

To resolve this issue, ensure that the nodes you are passing to the index contain either text or images. You can check the nodes by printing them out before passing them to the index. If the nodes do not contain any text or images, you may need to preprocess your data to ensure that it is in the correct format.

Regarding the use of a neo4j knowledge graph with embeddings without persisting locally, the LlamaIndex does not seem to have built-in support for neo4j. However, you could potentially use the neo4j Python driver to fetch data from your neo4j database, convert it into the BaseNode or ImageNode format expected by LlamaIndex, and then pass these nodes to the index.

Here is a simplified example of how you might do this:

from neo4j import GraphDatabase
from llama_index.indices.knowledge_graph import BaseNode

# Initialize the neo4j driver
driver = GraphDatabase.driver("<neo4j_url>", auth=("<username>", "<password>"))

# Fetch data from neo4j
with driver.session() as session:
    result = session.run("MATCH (n) RETURN n")

# Convert the data into BaseNode format
nodes = [BaseNode(id=record["n"].id, text=record["n"]["text"]) for record in result]

# Pass the nodes to the index
index = KnowledgeGraphIndex.from_documents(
    nodes,
    storage_context=storage_context,
    service_context=service_context,
    include_embeddings=True
)

Please note that this would require a good understanding of both the structure of your neo4j database and the LlamaIndex codebase. Also, the above code is a simplified example and may not work as-is depending on the structure of your neo4j database.

I hope this helps! If you have any further questions, please don't hesitate to ask.

For more information, you can refer to the following sources:

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
j0sephsasson commented 4 months ago

@dosu-bot

If I specify an external vector store (cog search, for example), and pass the vector & graph store to storage context, then run this code:

index = KnowledgeGraphIndex.from_documents(
    docs,
    storage_context=storage_context,
    service_context=service_context,
    include_embeddings=True
)

Only the graph store is created. Is it possible to construct graph with embeddings and use external stores (not local. like neo4j and cog search)?

dosubot[bot] commented 4 months ago

πŸ€–

In the version 0.9.5 of LlamaIndex, constructing a knowledge graph with embeddings using external stores like neo4j and Azure Cognitive Search is not directly supported. The current code only supports creating a graph store using the GraphStore class, which is used to store the graph data in memory.

However, you can extend the functionality of LlamaIndex to support external stores like neo4j and Azure Cognitive Search by creating custom classes that inherit from the GraphStore class and override its methods to interact with the external stores.

Here is a basic example of how you might create a custom class for neo4j:

from llama_index.legacy.graph_stores.types import GraphStore
from neo4j import GraphDatabase

class Neo4jGraphStore(GraphStore):
    def __init__(self, uri, user, password):
        self._driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        self._driver.close()

    def upsert_triplet(self, subject, relationship, object):
        with self._driver.session() as session:
            session.run("MERGE (a:Subject {name: $subject}) "
                        "MERGE (b:Object {name: $object}) "
                        "MERGE (a)-[r:RELATIONSHIP {name: $relationship}]->(b)",
                        subject=subject, relationship=relationship, object=object)

This Neo4jGraphStore class connects to a neo4j database and overrides the upsert_triplet method to insert a triplet into the neo4j database. You would need to implement the other methods of the GraphStore class in a similar way.

Please note that this is a simplified example and might not cover all your needs. You would need to handle errors, manage transactions, and possibly optimize the database queries for your specific use case. Also, you would need to do a similar process for Azure Cognitive Search.

As for the changes made in the LlamaIndex codebase from version 0.9.5 to the latest version v0.10.3, I wasn't able to find specific details within the repository. It's possible that the changes are documented elsewhere or I may have missed them. If you have more specific questions about the changes or need further assistance, please provide more details.

You can find more information about the GraphStore class and its methods in the llama-index-legacy/llama_index/legacy/indices/knowledge_graph/base.py file.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
muddassir-lateef commented 2 months ago

@j0sephsasson I got the same problem were you able to solve it somehow?

SatheeshJM commented 2 months ago

running into the same problem.

trying to store embeddings along with the nodes in neo4j.. I can see in logs that the embeddings are gnerated, but they dont seem to be stored

Here is my code

from llama_index.core import KnowledgeGraphIndex, StorageContext, ServiceContext
from llama_index.readers.web import SimpleWebPageReader
from llama_index.graph_stores.neo4j import Neo4jGraphStore
from langchain_openai import ChatOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

documentUrls = ["http://some-random-url"]
openAIApiKey = "MY_OPEN_AI_API_KEY"
LLAMA_GRAPH_STORE = Neo4jGraphStore(
        username="myUsername",
        password="myPassword",
        url="neo4jUrl")

llm= ChatOpenAI(model="gpt-3.5-turbo-16k", openai_api_key = openAIApiKey)
embedLLM = OpenAIEmbedding( model="text-embedding-3-large",api_key=openAIApiKey)

storage_context = StorageContext.from_defaults(graph_store=LLAMA_GRAPH_STORE)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embedLLM)
documents = SimpleWebPageReader(html_to_text=True).load_data(documentUrls)

KnowledgeGraphIndex.from_documents(
            documents=documents,
            show_progress=True,
            storage_context = storage_context,
            service_context = service_context,
            include_embeddings = True
        )

I can see embeddings being generated in output

image

but cant find them in my local neo4j