run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.36k stars 4.67k forks source link

[Bug]: Vector embedding not getting stored in Neo4j when passing include_embeddings=True #13185

Open ritzvik opened 2 months ago

ritzvik commented 2 months ago

Bug Description

As I build the knowledge graph, I am expecting the nodes in Neo4j to contain the embedding vector as a property.

But there is only id as property with the node type "Entity". I'm following the example: https://docs.llamaindex.ai/en/stable/examples/index_structs/knowledge_graph/KnowledgeGraphDemo/

Instead of the OpenAI model suggested in the example, I'm using "TheBloke/Mistral-7B-Instruct-v0.2-GGUF" from huggingface and "thenlper/gte-large" as the embedding model.

Here is few sample nodes from Neo4j retrieved using Cypher query. Embedding vector is nowhere to be seen.

MATCH (n) RETURN n LIMIT 2;

{
  "identity": 0,
  "labels": [
    "Entity"
  ],
  "properties": {
    "id": "Paul graham"
  },
  "elementId": "4:7dcf7873-019c-40e4-bcdc-9b50ab418257:0"
}
{
  "identity": 1,
  "labels": [
    "Entity"
  ],
  "properties": {
    "id": "Writing"
  },
  "elementId": "4:7dcf7873-019c-40e4-bcdc-9b50ab418257:1"
}

Version

0.1.4

Steps to Reproduce

Here is the sample code I used

# https://docs.llamaindex.ai/en/stable/examples/index_structs/knowledge_graph/KnowledgeGraphDemo/
import time
import torch
from llama_index.core import Settings
from huggingface_hub import hf_hub_download, snapshot_download
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

from llama_index.core import KnowledgeGraphIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.graph_stores.neo4j import Neo4jGraphStore
from IPython.display import Markdown, display

supported_embed_models = ["thenlper/gte-large"]

supported_llm_models = {
    "TheBloke/Mistral-7B-Instruct-v0.2-GGUF": "mistral-7b-instruct-v0.2.Q5_K_M.gguf",
    "microsoft/Phi-3-mini-4k-instruct-gguf": "Phi-3-mini-4k-instruct-q4.gguf",
}

model_name="TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
embed_model_name="thenlper/gte-large"
temperature=0.0
max_new_tokens=256
context_window=4096
gpu_layers=20
dim=1024
memory_token_limit=4096
sentense_embedding_percentile_cutoff=0.8
similarity_top_k=2
hf_token="<Hugging-face-token>"

MODELS_PATH = "./models"
EMBED_PATH = "./embed_models"

n_gpu_layers = 0
if torch.cuda.is_available():
    print("It is a GPU node, setup GPU.")
    n_gpu_layers = gpu_layers

def get_model_path(model_name):
    filename = supported_llm_models[model_name]
    model_path = hf_hub_download(
        repo_id=model_name,
        filename=filename,
        resume_download=True,
        cache_dir=MODELS_PATH,
        local_files_only=False,
        token=hf_token,
    )
    return model_path

def get_embed_model_path( embed_model):
    embed_model_path = snapshot_download(
        repo_id=embed_model,
        resume_download=True,
        cache_dir=EMBED_PATH,
        local_files_only=False,
        token=hf_token,
    )
    return embed_model_path

llm = LlamaCPP(
    model_path=get_model_path(model_name),
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=context_window,
    # kwargs to pass to __call__()
    # generate_kwargs={"temperature": 0.0, "top_k": 5, "top_p": 0.95},
    generate_kwargs={"temperature": temperature},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": n_gpu_layers},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    cache_folder=EMBED_PATH,
)

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024

# copy the essay into the directory: https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt 
documents = SimpleDirectoryReader("./example_data").load_data()

graph_store = Neo4jGraphStore(
    username=<Neo4j-username>,
    password=<Neo4j-password>,
    url=<Neo4j-uri>,
    database=<Neo4j-database>,
)
graph_store.query("MATCH (n) DETACH DELETE n")

storage_context = StorageContext.from_defaults(graph_store=graph_store)

# NOTE: can take a while!
index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=2,
    include_embeddings=True,
    show_progress=True,    
)

query_engine1 = index.as_query_engine(
    include_text=False, response_mode="tree_summarize"
)
query_engine2 = index.as_query_engine(
    include_text=True, response_mode="tree_summarize"
)
query_engine3 = index.as_query_engine(
    include_text=True,
    response_mode="tree_summarize",
    embedding_mode="hybrid",
    similarity_top_k=5,
)

response = query_engine1.query("Tell me more about Interleaf")
display(Markdown(f"<b>{response}</b>"))

response = query_engine2.query(
    "Tell me more about what the author worked on at Interleaf"
)
display(Markdown(f"<b>{response}</b>"))

response = query_engine3.query(
    "Tell me more about what the author worked on at Interleaf"
)
display(Markdown(f"<b>{response}</b>"))

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 2 months ago

It appears the issue you're encountering with embedding vectors not being stored in Neo4j nodes, despite setting include_embeddings=True, could stem from a few potential areas. Let's explore some avenues to troubleshoot and resolve this problem:

  1. Implementation Details: Ensure that the method responsible for storing data in Neo4j (likely within the Neo4jGraphStore class or similar) correctly handles embedding vectors. The code should explicitly include logic to store embeddings as properties of nodes. If this logic is missing or incorrect, the embeddings won't be saved.

  2. Embedding Format: Verify the format of the embeddings generated by your embedding model (thenlper/gte-large). The storage mechanism might expect a specific format (e.g., a list of floats) that differs from what the model produces.

  3. Model Compatibility: There could be compatibility issues between the LlamaIndex framework and the specific models you're using (TheBloke/Mistral-7B-Instruct-v0.2-GGUF for LLM and thenlper/gte-large for embeddings). Ensure that these models are fully compatible with the framework, especially regarding how embeddings are generated and handled.

  4. Neo4j Configuration: Neo4j has certain limitations and configurations that might affect the storage of complex data types like vectors. Check Neo4j's documentation for any constraints on property types and sizes. You might need to adjust your Neo4j schema or configuration to accommodate embedding vectors.

  5. Debugging Steps: Consider adding logging or debugging statements in the code that handles the embedding storage process. This can help identify where the process might be failing or if the embeddings are being processed as expected.

By systematically checking these areas, you should be able to pinpoint the cause of the issue and implement a solution. If the problem persists, reviewing the documentation of the LlamaIndex framework or reaching out to its community for support might offer additional insights.

#### Details **Sources** - [Visualising Embeddings](https://github.com/run-llama/llama_index/discussions/10173) - [[Bug]: Cannot construct neo4j knowledge graph with embeddings](https://github.com/run-llama/llama_index/issues/10767) - [llama-index-integrations/vector_stores/llama-index-vector-stores-neo4jvector/llama_index/vector_stores/neo4jvector/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-neo4jvector/llama_index/vector_stores/neo4jvector/base.py) - [llama-index-legacy/llama_index/legacy/vector_stores/neo4jvector.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/vector_stores/neo4jvector.py) - [docs/docs/examples/vector_stores/Neo4jVectorDemo.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/Neo4jVectorDemo.ipynb) - [llama-index-legacy/llama_index/legacy/embeddings/llm_rails.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/embeddings/llm_rails.py) - [llama-index-legacy/llama_index/legacy/VERSION](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/VERSION) - [llama-index-integrations/embeddings/llama-index-embeddings-huggingface/llama_index/embeddings/huggingface/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-huggingface/llama_index/embeddings/huggingface/base.py) - [docs/docs/examples/embeddings/huggingface.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/embeddings/huggingface.ipynb) - [llama-index-integrations/embeddings/llama-index-embeddings-ipex-llm/llama_index/embeddings/ipex_llm/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/embeddings/llama-index-embeddings-ipex-llm/llama_index/embeddings/ipex_llm/base.py) - [llama-index-legacy/llama_index/legacy/embeddings/huggingface.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/embeddings/huggingface.py) - [llama-index-legacy/llama_index/legacy/indices/knowledge_graph/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/indices/knowledge_graph/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich commented 2 months ago

They don't get stored in neo4j, they get stored in the index struct

its a janky design, but thats how it works. And it won't be changed anytime soon.

Working on a completely new index to overhaul this though. The current knowledge graph index design is not great