Open xiaomomo opened 3 weeks ago
Hey there, @xiaomomo! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this issue together.
The issue you're experiencing, where the returned source_nodes
do not include nodes with the original text of the PDF, could be related to several factors in the embedding and retrieval process:
Embedding Quality: The OllamaEmbedding
class is responsible for generating embeddings for the text. If the embeddings do not accurately capture the semantic meaning of the original text, the retrieval process might not identify them as relevant. Ensure that the OllamaEmbedding
is correctly configured and that the model is suitable for your data type and query.
Indexing Process: During the indexing process, ensure that the text from the PDF is being correctly parsed and embedded. The SimpleDirectoryReader
should correctly read and process the PDF content. If there are issues with how the text is extracted, it might affect the embeddings and, consequently, the retrieval results.
Query and Retrieval: The VectorStoreIndex
uses the embeddings to find the most similar nodes to the query. If the query is not well-aligned with the indexed content, or if the similarity threshold is not optimal, it might not return the expected nodes. You might want to experiment with different query formulations or adjust the similarity threshold.
Scoring and Ranking: The node.score
represents the relevance of a node to the query. If the scores are low, it might indicate that the embeddings or the query are not well-matched. Consider reviewing the scoring mechanism or using a different model to see if it improves the results.
Model Configuration: Double-check the configuration of the OllamaEmbedding
and ensure that the model is running correctly on your local setup. Any misconfiguration could lead to suboptimal embeddings.
By addressing these areas, you might be able to improve the relevance of the source_nodes
returned by your queries.
To continue talking to Dosu, mention @dosu.
@xiaomomo I'm not sure what you mean by "why the llama-index don't use original text of the PDF node as source_nodes?"
It's using the data you indexed. If it's returning something else, what is it reuturning? The only nodes it has access to is what you gave to it.
I wouldn't recommend using llama3.1 as an embedding model though, it's not trained for text retrieval. If you are using Ollama, use nomic embeddings for example
Question Validation
Question
I used OllamaEmbedding llama3:8b for local testing. After completing the index building for the PDF file, I asked questions using the original text of the PDF. The returned source_nodes did not include any nodes with the original text.
Here is my code:
Below are the returned source_nodes, and the node.score is only 0.445, 0.435
why the llama-index don't use original text of the PDF node as source_nodes?