Open JoaoPauloAires opened 2 months ago
Olá @JoaoPauloAires! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e se tornar um colaborador. Estou aqui para te ajudar enquanto aguardamos um mantenedor humano.
Yes, the behavior of observing different retrieval accuracies with different window sizes in SentenceWindowNodeParser
is expected. The SentenceWindowNodeParser
parses documents into single sentences per node, with each node containing a "window" of surrounding sentences. This window size can affect the granularity and context of the retrieved information, leading to variations in retrieval accuracy [1][2][3].
The SentenceSplitter
ensures that the text is split at sentence boundaries, which is crucial for the SentenceWindowNodeParser
to accurately capture and process individual sentences along with their surrounding context. The SentenceWindowNodeParser
uses a default sentence splitter to break the text into sentences and then adds a window of surrounding sentences to each node's metadata, ensuring that each node contains context from its neighboring sentences [4][5].
The Faiss
vector search engine interacts with the nodes created by the SentenceWindowNodeParser
by adding the embeddings of these nodes to the Faiss index. During query time, Faiss retrieves the top k most similar embeddings from the index, which affects retrieval accuracy by relying on the quality of the embeddings and the effectiveness of the Faiss search algorithm [6][7].
In summary, the window size directly influences the retrieval process and its accuracy due to the varying context provided by different window sizes.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I am testing SentenceWindowNodeParser with different window sizes, however, depending on the window size, it obtains a different retrieval accuracy (find the correct context source for a certain question). As far as I understand, it should not happen as the comparison occurs between query and sentence, so, regardless of the window size, it would always find the same sentence and the only change would be the size of the context (that would depend on the window size). I am currently using Faiss as Vector Search and Inner Product as similarity metric. I am also using SentenceSplitter with default parameters. While trying to identify the problem, I noticed that the content from the "original_text" metadata is a single sentence while node.text from source_nodes is the combination of the sentence and its surrounding ones. I am following the instructions from this page, but using a chat_engine.
I would like to know if this is expected or some kind of bug.
Thanks in advance.