Closed Leonschmitt closed 11 months ago
🤖
Hi @Leonschmitt! Good to see you again. I hope your project with LlamaIndex is coming along well. Let's dive into your issue here.
Based on the information you've provided and the similar issue I found in the LlamaIndex repository, it seems like the ZeroDivisionError
you're encountering is due to the fact that the docstore
in your vector_retriever
is empty. This is because when using a vector db integration like Weaviate, all nodes are serialized into the vector store, simplifying storage. However, this can lead to the issue you're experiencing.
The easiest solution is to manually keep track of a docstore
. Here's how you can do it:
from llama_index.storage.docstore import SimpleDocumentStore
docstore = SimpleDocumentStore()
docstore.add_documents(documents)
docstore.persist(persist_path="./docstore.json")
new_docstore = SimpleDocumentStore.from_persist_path("./docstore.json")
This way, you can manage your documents separately and avoid the ZeroDivisionError
. You can find more details about this solution in the related issue I found.
Regarding your question about initializing the BM25Retriever
when you only have access to an already existing Weaviate vector store, you can use the from_vector_store
class method provided in the VectorStoreIndex
class. This method allows you to create an instance of VectorStoreIndex
using an existing vector store. However, please ensure that your vector store stores text to avoid the ZeroDivisionError
. If it doesn't, you might need to adjust your vector store configuration or use a different vector store that supports text storage. Here is how you can initialize the VectorStoreIndex
:
from llama_index.indices.vector_store.base import VectorStoreIndex
# Assuming `vector_store` is your existing Weaviate vector store
vector_store_index = VectorStoreIndex.from_vector_store(vector_store)
After initializing the VectorStoreIndex
, you can use the as_retriever
method to get a retriever:
retriever = vector_store_index.as_retriever()
You can find more details about this in the LlamaIndex codebase.
Finally, the BM25Retriever
in the LlamaIndex framework is a class that extends the BaseRetriever
and is responsible for retrieving relevant nodes (documents) based on a given query. It uses the BM25 (Best Matching 25) algorithm, a popular ranking function used by search engines to rank matching documents according to their relevance to a given search query. The BM25Retriever
interacts with the VectorStoreIndex
in its from_defaults
class method. If an instance of VectorStoreIndex
is passed to this method, the BM25Retriever
uses the document store (docstore
) from the VectorStoreIndex
to get the nodes (documents) that it will use for retrieval. You can find more details about this in the LlamaIndex codebase.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@Leonschmitt The docstore isn't used by default when using a vectordb integration
Keep the nodes and do bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, ...)
Which you've already pointed out doesn't work for you.
Weaviate doesn't really have an API to get all nodes. The closest is
# 10,000 is the max weaviate can fetch
source_nodes = index.as_retriever(similarity_top_k=10000).retrieve("Dummy query")
nodes = [x.node for node in source_nodes]
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, ...)
Thanks @logan-markewich very helpful
@logan-markewich typo
nodes = [x.node for node in source_nodes]
should be nodes = [x.node for x in source_nodes]
10000 is a hard-coded number. If the count of documents is greater than this number, will it cause a problem?
The BM25 retriever needs to read all nodes from the vector store, which is not efficient. If I use Elasticsearch, will it have an efficient way to perform BM25 retrieval?
Question Validation
Question
I am developing a Streamlit application that leverages LlamaIndex, and I'm attempting to integrate a BM25 Retriever as outlined in a tutorial available here. My current challenge involves initializing the BM25 Retriever using an existing Weaviate vector store.
Here's a snippet of my code:
When executing this, I run into a ZeroDivisionError during the initialization of the BM25Retriever. My application setup includes a Weaviate database (vector_store) where PDF documents are loaded using a SimpleDirectoryReader. The data is processed with a SentenceSplitter and a TitleExtractor, and embeddings are created for each node (node), which are then stored using vector_store.add(nodes). Here is my code snippet:
I sought solutions on Discord, but the common advice pointed to the necessity of supplying nodes to the BM25 Retriever. However, my preprocessing and storing processes are outsourced to a service and done prior in time. This leads me to question how to initialize the BM25 Retriever when I only have access to an already existing Weaviate vector store.
I am unsure if I have overlooked something or made a mistake in the process. I am reaching out to see if anyone can relate to this problem or provide insights into resolving it.