run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.91k stars 5.29k forks source link

[Question]: BM25 Retriever Accessing All Documents in Document Store Instead of VectorStore Index #16588

Open JPC612 opened 1 month ago

JPC612 commented 1 month ago

Question Validation

Question

Hello, I am currently using the following configuration for the BM25 retriever:

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, 
    similarity_top_k=5,
)

I’ve noticed that the BM25 retriever seems to access all documents in the docstore, not just the ones contained within the VectorStoreIndex (index). Is this behavior expected or intended?

In my use case, I want the BM25 retriever to only retrieve documents that are also present in the corresponding VectorStoreIndex (index), as otherwise, in hybrid search, it retrieves documents that users should not see because they are not part of the specific index. Is there a way to configure BM25 so it only retrieves documents that are also in the VectorStoreIndex?

For context, I am using Qdrant as my vector store and RedisDocumentStore as my document store.

Thanks in advance for your help!

dosubot[bot] commented 1 month ago

The BM25Retriever, as currently implemented, interacts primarily with the document store to retrieve nodes for the BM25 model. It does not have a built-in mechanism to limit retrieval to only those documents present in a specific VectorStoreIndex. This means that, by default, it will access all documents in the document store, not just those in the VectorStoreIndex [1][2].

Unfortunately, there is no configuration option or method available in the BM25Retriever to filter documents based on their presence in a VectorStoreIndex. This limitation might lead to the behavior you're observing, where the retriever accesses all documents in the document store [1][2].

To continue talking to Dosu, mention @dosu.