run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.17k stars 5.16k forks source link

[Question]: SentenceWindowNodeParser obtaining different results for retrieval depending on the window size. #14906

Open JoaoPauloAires opened 2 months ago

JoaoPauloAires commented 2 months ago

Question Validation

Question

I am testing SentenceWindowNodeParser with different window sizes, however, depending on the window size, it obtains a different retrieval accuracy (find the correct context source for a certain question). As far as I understand, it should not happen as the comparison occurs between query and sentence, so, regardless of the window size, it would always find the same sentence and the only change would be the size of the context (that would depend on the window size). I am currently using Faiss as Vector Search and Inner Product as similarity metric. I am also using SentenceSplitter with default parameters. While trying to identify the problem, I noticed that the content from the "original_text" metadata is a single sentence while node.text from source_nodes is the combination of the sentence and its surrounding ones. I am following the instructions from this page, but using a chat_engine.

I would like to know if this is expected or some kind of bug.

Thanks in advance.

dosubot[bot] commented 2 months ago

Olá @JoaoPauloAires! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e se tornar um colaborador. Estou aqui para te ajudar enquanto aguardamos um mantenedor humano.

Yes, the behavior of observing different retrieval accuracies with different window sizes in SentenceWindowNodeParser is expected. The SentenceWindowNodeParser parses documents into single sentences per node, with each node containing a "window" of surrounding sentences. This window size can affect the granularity and context of the retrieved information, leading to variations in retrieval accuracy [1][2][3].

The SentenceSplitter ensures that the text is split at sentence boundaries, which is crucial for the SentenceWindowNodeParser to accurately capture and process individual sentences along with their surrounding context. The SentenceWindowNodeParser uses a default sentence splitter to break the text into sentences and then adds a window of surrounding sentences to each node's metadata, ensuring that each node contains context from its neighboring sentences [4][5].

The Faiss vector search engine interacts with the nodes created by the SentenceWindowNodeParser by adding the embeddings of these nodes to the Faiss index. During query time, Faiss retrieves the top k most similar embeddings from the index, which affects retrieval accuracy by relying on the quality of the embeddings and the effectiveness of the Faiss search algorithm [6][7].

In summary, the window size directly influences the retrieval process and its accuracy due to the varying context provided by different window sizes.

To continue talking to Dosu, mention @dosu.