Closed mw19930312 closed 2 months ago
The "division by zero" error you're encountering with the BM25Retriever
typically happens when the algorithm tries to calculate scores for a query against an empty corpus, or the documents in your corpus result in no tokens after preprocessing. This can occur if the documents are empty, or if the tokenization process filters out all content from the documents.
To resolve this, ensure your corpus is not empty and contains valid, tokenizable content. If your corpus is valid but you're still facing this issue, it might be due to the tokenization process removing all tokens from your documents, leaving them empty. In such cases, you might need to adjust the tokenization process to ensure that documents are left with tokens after preprocessing.
Here's a modification you can make to the BM25Retriever
initialization to filter out empty token lists and avoid the "division by zero" error:
class BM25Retriever(BaseRetriever):
def __init__(
self,
nodes: List[BaseNode],
tokenizer: Optional[Callable[[str], List[str]]] = None,
similarity_top_k: int = 100,
verbose: bool = False,
) -> None:
self._nodes = nodes
self._tokenizer = tokenizer or default_tokenizer
# Tokenize the content of each node and filter out empty token lists
self._corpus = [self._tokenizer(node.content) for node in self._nodes if self._tokenizer(node.content)]
if not self._corpus:
raise ValueError("The corpus is empty or all documents resulted in no tokens after tokenization.")
self.bm25 = BM25Okapi(self._corpus)
This ensures that the corpus passed to BM25Okapi
does not contain any empty token lists, which should help in avoiding the "division by zero" error. Additionally, it raises a ValueError
if the corpus is empty or if all documents result in no tokens after tokenization, allowing for easier debugging and error handling.
Please try adjusting your tokenization process or ensuring your corpus contains valid content and let me know if the issue persists.
This only happens if index.docstore is empty
My guess is you are using some vector db integration
Try passing in the nodes directly BM25Retriever.from_defaults(nodes=nodes, ...)
@logan-markewich Thanks for the reply! We initiated the vector db by
index = VectorStoreIndex(
nodes=[],
embed_model=opensearch_client.os_client.embedding_model,
storage_context=StorageContext.from_defaults(vector_store=vector_store),
callback_manager=callback_manager
)
and this to initialize the bm retriever.
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore, similarity_top_k=config.top_k
)
Are you suggesting that the BM25Retriever is not dynamically updated as the index has more nodes (the CitationQueryEngine is able to do so)? I kinda feel that breaking into the pipeline is nasty...
Also, what is the nodes
in your previous comments? Are they all the nodes in vector db? I suppose bm25 returns the top k nodes from all existing nodes in the db, right?
The BM25 retriever does not interact with your vector db
It needs the nodes, and stores them in memory, and its up to you to get the nodes to create the retriever.
This either comes from the docstore, but when using a vector db integration, the docstore is disabled to simplify storage. You could manually create and maintain a docstore with the nodes if you wanted.
Otherwise, in your case, you need a way to get the nodes from your vector db so that you can create the retriever
@logan-markewich Thanks for the reply! After carefully going over our code, I'm considering overwriting the retrieve function in llama index for our use case. However, I'm not sure how to retrieve all the nodes available in the index and pass it to the bm25 retriever. How can I do it? Or should I overwrite something higher-level?
async def aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
nodes = await self._retriever.aretrieve(query_bundle)
for postprocessor in self._node_postprocessors:
nodes = postprocessor.postprocess_nodes(nodes, query_bundle=query_bundle)
return nodes
where self._retriever is a base retriever witht he aretrieve function defined as
async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
self._check_callback_manager()
dispatch_event = dispatcher.get_dispatch_event()
dispatch_event(
RetrievalStartEvent(
str_or_query_bundle=str_or_query_bundle,
)
)
if isinstance(str_or_query_bundle, str):
query_bundle = QueryBundle(str_or_query_bundle)
else:
query_bundle = str_or_query_bundle
with self.callback_manager.as_trace("query"):
with self.callback_manager.event(
CBEventType.RETRIEVE,
payload={EventPayload.QUERY_STR: query_bundle.query_str},
) as retrieve_event:
import pdb; pdb.set_trace()
nodes = await self._aretrieve(query_bundle=query_bundle)
nodes = await self._ahandle_recursive_retrieval(
query_bundle=query_bundle, nodes=nodes
)
retrieve_event.on_end(
payload={EventPayload.NODES: nodes},
)
dispatch_event(
RetrievalEndEvent(
str_or_query_bundle=str_or_query_bundle,
nodes=nodes,
)
)
return nodes
You could retrieve with a super high top k (like 20000) and give those nodes to bm25. Or just store your nodes somewhere they are easy to access (like a docstore)
I don't think overriding methods is quite nesccary here. But up to you
The problem with the suggested fixes is that you will always have to load nodes in memory, rendering the optimizations offered by the vector DB service pointless. The point is to leverage them using your search method of choice. Not only that, but in the context of a real-time app, you will have to always manually update your BM25-based query engine to sync with your third-party index. BM25 in llama, and any other search method, should have the ability to be synced with the vector-store of choice.
@abdelatifsd bm25 in llama-index is using the rank-bm25 library, so it's limited to what that package offers, which is all in memory.
Some vector dbs have bm25 built in.
The thing either bm25 is that it's static. If any documents are added to the index, ALL sparse embeddings need to he updated.
Bug Description
I'm following the tutorial https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion/ to create a fusion retriever. However, I encounter an error of division by zero in bm25 retriever.
I've provided a code example below. Everything works before I added the line of
retriever=retriever
in CitationQueryEngine.Can anyone take a look and provide some help? Or at least let me know how I should debug? Thanks!
Version
0.10.16
Steps to Reproduce
from llama_index.retrievers.bm25 import BM25Retriever from llama_index.core.retrievers import QueryFusionRetriever vector_retriever = index.as_retriever(similarity_top_k=config.top_k)
bm25_retriever = BM25Retriever.from_defaults( docstore=index.docstore, similarity_top_k=config.top_k )
retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], similarity_top_k=config.top_k, num_queries=4, # set this to 1 to disable query generation mode="reciprocal_rerank", use_async=True, verbose=True, ) citation_query_engine = CitationQueryEngine.from_args( llm=final_response_llm, retriever=retriever, index=index,
here we can control how granular citation sources are, set to match the max ingestion chunk size
)
Relevant Logs/Tracbacks