run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.89k stars 5.28k forks source link

[Question]: The embedding didn't give me better source nodes. #16752

Open xiaomomo opened 3 weeks ago

xiaomomo commented 3 weeks ago

Question Validation

Question

I used OllamaEmbedding llama3:8b for local testing. After completing the index building for the PDF file, I asked questions using the original text of the PDF. The returned source_nodes did not include any nodes with the original text.

Here is my code:

from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core import StorageContext, load_index_from_storage
import os.path
from llama_index.llms.ollama import Ollama
import logging
import sys
# from llama_parse import LlamaParse
from llama_index.embeddings.ollama import OllamaEmbedding

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# DATA_BOOK = "./data/book"
DATA_BOOK = "./data/book_v2"
# DATA_BOOK_V2 = "./data/book/Go Away Big Green Monster.pdf"

# PERSIST_DIR = "./data/persist_index"
PERSIST_DIR = "./data/persist_index_v3"
# PERSIST_DIR = "./data/persist_index_v2"

Settings.llm = Ollama(model="llama3:8b", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(
    model_name="llama3:8b",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)
# Settings.embed_model = HuggingFaceEmbedding(
#     model_name="BAAI/bge-small-en-v1.5"
# )

# check if exists
if os.path.exists(PERSIST_DIR):
    print("Index already exists")
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(
        storage_context
    )

else:
    print("create new index")
    documents = SimpleDirectoryReader(DATA_BOOK).load_data()
    # parser = LlamaParse(api_key="llx-FH52ORx3RmYA6C5hbM8GiNUB91Czg4DlkvOuZEwDogkeqbO2") network error
    # documents = parser.load_data(DATA_BOOK_V2)
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)

print("finish index")
# create a query engine
query_engine = index.as_query_engine()
response = query_engine.query("how to use adverbs of sequence  to describe  a process")
print("----llm rag response-----")
print(response)
for node in response.source_nodes:
    print("-----")
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:1000]
    print(f"Text:\t {text_fmt} ...")
    print(f"Metadata:\t {node.node.metadata}")
    print(f"Score:\t {node.score:.3f}")

Below are the returned source_nodes, and the node.score is only 0.445, 0.435

why the llama-index don't use original text of the PDF node as source_nodes?

dosubot[bot] commented 3 weeks ago

Hey there, @xiaomomo! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this issue together.

The issue you're experiencing, where the returned source_nodes do not include nodes with the original text of the PDF, could be related to several factors in the embedding and retrieval process:

  1. Embedding Quality: The OllamaEmbedding class is responsible for generating embeddings for the text. If the embeddings do not accurately capture the semantic meaning of the original text, the retrieval process might not identify them as relevant. Ensure that the OllamaEmbedding is correctly configured and that the model is suitable for your data type and query.

  2. Indexing Process: During the indexing process, ensure that the text from the PDF is being correctly parsed and embedded. The SimpleDirectoryReader should correctly read and process the PDF content. If there are issues with how the text is extracted, it might affect the embeddings and, consequently, the retrieval results.

  3. Query and Retrieval: The VectorStoreIndex uses the embeddings to find the most similar nodes to the query. If the query is not well-aligned with the indexed content, or if the similarity threshold is not optimal, it might not return the expected nodes. You might want to experiment with different query formulations or adjust the similarity threshold.

  4. Scoring and Ranking: The node.score represents the relevance of a node to the query. If the scores are low, it might indicate that the embeddings or the query are not well-matched. Consider reviewing the scoring mechanism or using a different model to see if it improves the results.

  5. Model Configuration: Double-check the configuration of the OllamaEmbedding and ensure that the model is running correctly on your local setup. Any misconfiguration could lead to suboptimal embeddings.

By addressing these areas, you might be able to improve the relevance of the source_nodes returned by your queries.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 3 weeks ago

@xiaomomo I'm not sure what you mean by "why the llama-index don't use original text of the PDF node as source_nodes?"

It's using the data you indexed. If it's returning something else, what is it reuturning? The only nodes it has access to is what you gave to it.

I wouldn't recommend using llama3.1 as an embedding model though, it's not trained for text retrieval. If you are using Ollama, use nomic embeddings for example