run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.9k stars 5.28k forks source link

[Question]: No embedding in query_engine result #16786

Open JohnTaco93 opened 2 weeks ago

JohnTaco93 commented 2 weeks ago

Question Validation

Question

I have the code below and it runs well. However when I inspect the 'response' variable the 'embedding' section is giving 'none'. This is weird since when I inspect the Knowledge graph ('index' variable) there are embedding vectors. Also, the 'response' variable shows 'score' of 1000 which seems weird since the cosine similarity should not be that high. I want to know the score used for the top 5 nodes.

Code

from llama_index.core import Document, VectorStoreIndex from llama_index.core import StorageContext from llama_index.core import Settings from llama_index.core import SimpleDirectoryReader, KnowledgeGraphIndex from llama_index.core.graph_stores import SimpleGraphStore from llama_index.core.response_synthesizers import ResponseMode from llama_index.llms.openai import OpenAI from IPython.display import Markdown, display

import openai

from openai import OpenAI

import os import sys from collections import Counter import json sys.path.append('../..')

from dotenv import load_dotenv, finddotenv = load_dotenv(find_dotenv()) # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY'] client = OpenAI()

bitbucket/ja-chatbot-resources/John/john_personal_development/KG/small_docs

documents = SimpleDirectoryReader( "d:/KG/small_docs" ).load_data()

llm = OpenAI(temperature=0, model="gpt-4o-mini") Settings.llm = llm Settings.chunk_size = 512

graph_store = SimpleGraphStore() storage_context = StorageContext.from_defaults(graph_store=graph_store)

from llama_index.embeddings.openai import OpenAIEmbedding embedding_model = OpenAIEmbedding(model='text-embedding-3-small') # or another embedding model

index = KnowledgeGraphIndex.from_documents( documents, max_triplets_per_chunk=4, storage_context=storage_context, embedding_model=embedding_model, # Ensure embeddings are generated include_embeddings=True, )

query_engine = index.as_query_engine( include_text=True, response_mode="tree_summarize", embedding_mode="dense", similarity_top_k=5, include_embeddings=True, return_source_nodes=True )

response = query_engine.query( 'Who are the people?', ) print(response.response)

response variable (only one part)

[NodeWithScore(node=TextNode(id_='222c69a6-266f-4b04-85bc-1801377ce8e9', embedding=None, metadata={'file_path': 'd:\bitbucket\ja-chatbot-resources\John\john_personal_development\KG\small_docs\document2.txt', 'file_name': 'document2.txt', 'file_type': 'text/plain', 'file_size': 88, 'creation_date': '2024-07-25', 'last_modified_date': '2024-07-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='d0a85289-2b8d-4c00-953a-f764761288ad', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'd:\bitbucket\ja-chatbot-resources\John\john_personal_development\KG\small_docs\document2.txt', 'file_name': 'document2.txt', 'file_type': 'text/plain', 'file_size': 88, 'creation_date': '2024-07-25', 'last_modified_date': '2024-07-16'}, hash='060b4b2ded6ec94ca755735f78cd4cfa3b18eca7e2eea124677198d0ca75a1b4')}, text='Bob works at Acme Corp, which has an office in New York. Bob is also friends with Alice.', mimetype='text/plain', start_char_idx=0, end_char_idx=88, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=1000.0),

dosubot[bot] commented 2 weeks ago

I found a similar open issue that might be relevant to your problem: No embeddings from DocumentSummaryIndexEmbeddingRetriever?. This issue is still open and was last updated on October 29, 2024.

The problem with missing embeddings and scores could be due to the embeddings being generated in a lazy fashion. If the nodes do not have precomputed embeddings, they are generated on-the-fly during retrieval. If there is an issue with the embedding model or the embedding generation process, it could result in missing embeddings and scores. The embeddings are generated using the _get_embeddings method, which checks if a node's embedding is None and then generates it using the embedding model. If this process fails or is not executed, embeddings and scores will be missing.

Additionally, the DocumentSummaryIndexEmbeddingRetriever class does not directly return similarity scores. To obtain these scores, you would need to calculate the similarity between the query and each node and then pass this score when creating the NodeWithScore objects. This involves modifying the _retrieve method and potentially other parts of the class.

Regarding the score of 1000, I couldn't find specific information on how this score is calculated in the context of the query engine. It might be a default or placeholder value used when the actual similarity score cannot be computed. You might want to check the implementation details of the scoring mechanism in your setup.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 weeks ago

Embeddings are typically removed from the response to save time and memory

However, the knowledge graph index is deprecated/unmaintained at the moment. I would be using the property graph store https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/#using-a-property-graph-index

A quick example, in addition to the many on the docs page above https://colab.research.google.com/drive/1QPUjhFwZ-6azdhfteqcznhx5uIBxTMsf?usp=sharing