[Question]: "Initializing BM25 Retriever Using Pre-filled Vector Store Data

Leonschmitt commented 11 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I am developing a Streamlit application that leverages LlamaIndex, and I'm attempting to integrate a BM25 Retriever as outlined in a tutorial available here. My current challenge involves initializing the BM25 Retriever using an existing Weaviate vector store.

Here's a snippet of my code:

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

vector_retriever = index.as_retriever(similarity_top_k=5)

bm25_retriever = BM25Retriever.from_defaults(
   docstore=index.docstore, similarity_top_k=5
)

When executing this, I run into a ZeroDivisionError during the initialization of the BM25Retriever. My application setup includes a Weaviate database (vector_store) where PDF documents are loaded using a SimpleDirectoryReader. The data is processed with a SentenceSplitter and a TitleExtractor, and embeddings are created for each node (node), which are then stored using vector_store.add(nodes). Here is my code snippet:

vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Intranet"
)
node_parser = SentenceSplitter(chunk_size=480)
extractor = TitleExtractor(llm=llm)

# use transforms directly
nodes = node_parser(docs)
nodes = extractor(nodes)
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding
vector_store.add(nodes)

I sought solutions on Discord, but the common advice pointed to the necessity of supplying nodes to the BM25 Retriever. However, my preprocessing and storing processes are outsourced to a service and done prior in time. This leads me to question how to initialize the BM25 Retriever when I only have access to an already existing Weaviate vector store.

I am unsure if I have overlooked something or made a mistake in the process. I am reaching out to see if anyone can relate to this problem or provide insights into resolving it.

dosubot[bot] commented 11 months ago

🤖

Hi @Leonschmitt! Good to see you again. I hope your project with LlamaIndex is coming along well. Let's dive into your issue here.

Based on the information you've provided and the similar issue I found in the LlamaIndex repository, it seems like the ZeroDivisionError you're encountering is due to the fact that the docstore in your vector_retriever is empty. This is because when using a vector db integration like Weaviate, all nodes are serialized into the vector store, simplifying storage. However, this can lead to the issue you're experiencing.

The easiest solution is to manually keep track of a docstore. Here's how you can do it:

from llama_index.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(documents)

docstore.persist(persist_path="./docstore.json")
new_docstore = SimpleDocumentStore.from_persist_path("./docstore.json")

This way, you can manage your documents separately and avoid the ZeroDivisionError. You can find more details about this solution in the related issue I found.

Regarding your question about initializing the BM25Retriever when you only have access to an already existing Weaviate vector store, you can use the from_vector_store class method provided in the VectorStoreIndex class. This method allows you to create an instance of VectorStoreIndex using an existing vector store. However, please ensure that your vector store stores text to avoid the ZeroDivisionError. If it doesn't, you might need to adjust your vector store configuration or use a different vector store that supports text storage. Here is how you can initialize the VectorStoreIndex:

from llama_index.indices.vector_store.base import VectorStoreIndex

# Assuming `vector_store` is your existing Weaviate vector store
vector_store_index = VectorStoreIndex.from_vector_store(vector_store)

After initializing the VectorStoreIndex, you can use the as_retriever method to get a retriever:

retriever = vector_store_index.as_retriever()

You can find more details about this in the LlamaIndex codebase.

Finally, the BM25Retriever in the LlamaIndex framework is a class that extends the BaseRetriever and is responsible for retrieving relevant nodes (documents) based on a given query. It uses the BM25 (Best Matching 25) algorithm, a popular ranking function used by search engines to rank matching documents according to their relevance to a given search query. The BM25Retriever interacts with the VectorStoreIndex in its from_defaults class method. If an instance of VectorStoreIndex is passed to this method, the BM25Retriever uses the document store (docstore) from the VectorStoreIndex to get the nodes (documents) that it will use for retrieval. You can find more details about this in the LlamaIndex codebase.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 11 months ago

@Leonschmitt The docstore isn't used by default when using a vectordb integration

Keep the nodes and do bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, ...)

Which you've already pointed out doesn't work for you.

Weaviate doesn't really have an API to get all nodes. The closest is

# 10,000 is the max weaviate can fetch
source_nodes = index.as_retriever(similarity_top_k=10000).retrieve("Dummy query")
nodes = [x.node for node in source_nodes]
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, ...)

Sudar278 commented 8 months ago

Thanks @logan-markewich very helpful

playma commented 6 months ago

@logan-markewich typo

nodes = [x.node for node in source_nodes] should be nodes = [x.node for x in source_nodes]

playma commented 6 months ago

10000 is a hard-coded number. If the count of documents is greater than this number, will it cause a problem?

The BM25 retriever needs to read all nodes from the vector store, which is not efficient. If I use Elasticsearch, will it have an efficient way to perform BM25 retrieval?

run-llama / llama_index