Open JohnTaco93 opened 2 weeks ago
I found a similar open issue that might be relevant to your problem: No embeddings from DocumentSummaryIndexEmbeddingRetriever?. This issue is still open and was last updated on October 29, 2024.
The problem with missing embeddings and scores could be due to the embeddings being generated in a lazy fashion. If the nodes do not have precomputed embeddings, they are generated on-the-fly during retrieval. If there is an issue with the embedding model or the embedding generation process, it could result in missing embeddings and scores. The embeddings are generated using the _get_embeddings
method, which checks if a node's embedding is None
and then generates it using the embedding model. If this process fails or is not executed, embeddings and scores will be missing.
Additionally, the DocumentSummaryIndexEmbeddingRetriever
class does not directly return similarity scores. To obtain these scores, you would need to calculate the similarity between the query and each node and then pass this score when creating the NodeWithScore
objects. This involves modifying the _retrieve
method and potentially other parts of the class.
Regarding the score of 1000, I couldn't find specific information on how this score is calculated in the context of the query engine. It might be a default or placeholder value used when the actual similarity score cannot be computed. You might want to check the implementation details of the scoring mechanism in your setup.
To continue talking to Dosu, mention @dosu.
Embeddings are typically removed from the response to save time and memory
However, the knowledge graph index is deprecated/unmaintained at the moment. I would be using the property graph store https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/#using-a-property-graph-index
A quick example, in addition to the many on the docs page above https://colab.research.google.com/drive/1QPUjhFwZ-6azdhfteqcznhx5uIBxTMsf?usp=sharing
Question Validation
Question
I have the code below and it runs well. However when I inspect the 'response' variable the 'embedding' section is giving 'none'. This is weird since when I inspect the Knowledge graph ('index' variable) there are embedding vectors. Also, the 'response' variable shows 'score' of 1000 which seems weird since the cosine similarity should not be that high. I want to know the score used for the top 5 nodes.
Code
from llama_index.core import Document, VectorStoreIndex from llama_index.core import StorageContext from llama_index.core import Settings from llama_index.core import SimpleDirectoryReader, KnowledgeGraphIndex from llama_index.core.graph_stores import SimpleGraphStore from llama_index.core.response_synthesizers import ResponseMode from llama_index.llms.openai import OpenAI from IPython.display import Markdown, display
import openai
from openai import OpenAI
import os import sys from collections import Counter import json sys.path.append('../..')
from dotenv import load_dotenv, finddotenv = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY'] client = OpenAI()
bitbucket/ja-chatbot-resources/John/john_personal_development/KG/small_docs
documents = SimpleDirectoryReader( "d:/KG/small_docs" ).load_data()
llm = OpenAI(temperature=0, model="gpt-4o-mini") Settings.llm = llm Settings.chunk_size = 512
graph_store = SimpleGraphStore() storage_context = StorageContext.from_defaults(graph_store=graph_store)
from llama_index.embeddings.openai import OpenAIEmbedding embedding_model = OpenAIEmbedding(model='text-embedding-3-small') # or another embedding model
index = KnowledgeGraphIndex.from_documents( documents, max_triplets_per_chunk=4, storage_context=storage_context, embedding_model=embedding_model, # Ensure embeddings are generated include_embeddings=True, )
query_engine = index.as_query_engine( include_text=True, response_mode="tree_summarize", embedding_mode="dense", similarity_top_k=5, include_embeddings=True, return_source_nodes=True )
response = query_engine.query( 'Who are the people?', ) print(response.response)
response variable (only one part)
[NodeWithScore(node=TextNode(id_='222c69a6-266f-4b04-85bc-1801377ce8e9', embedding=None, metadata={'file_path': 'd:\bitbucket\ja-chatbot-resources\John\john_personal_development\KG\small_docs\document2.txt', 'file_name': 'document2.txt', 'file_type': 'text/plain', 'file_size': 88, 'creation_date': '2024-07-25', 'last_modified_date': '2024-07-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='d0a85289-2b8d-4c00-953a-f764761288ad', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'd:\bitbucket\ja-chatbot-resources\John\john_personal_development\KG\small_docs\document2.txt', 'file_name': 'document2.txt', 'file_type': 'text/plain', 'file_size': 88, 'creation_date': '2024-07-25', 'last_modified_date': '2024-07-16'}, hash='060b4b2ded6ec94ca755735f78cd4cfa3b18eca7e2eea124677198d0ca75a1b4')}, text='Bob works at Acme Corp, which has an office in New York. Bob is also friends with Alice.', mimetype='text/plain', start_char_idx=0, end_char_idx=88, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=1000.0),