run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.71k stars 5.26k forks source link

[Bug]: Division by zero in bm25 retriever #12732

Closed mw19930312 closed 2 months ago

mw19930312 commented 7 months ago

Bug Description

I'm following the tutorial https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion/ to create a fusion retriever. However, I encounter an error of division by zero in bm25 retriever.

I've provided a code example below. Everything works before I added the line of retriever=retriever in CitationQueryEngine.

Can anyone take a look and provide some help? Or at least let me know how I should debug? Thanks!

Version

0.10.16

Steps to Reproduce

from llama_index.retrievers.bm25 import BM25Retriever from llama_index.core.retrievers import QueryFusionRetriever vector_retriever = index.as_retriever(similarity_top_k=config.top_k)

bm25_retriever = BM25Retriever.from_defaults( docstore=index.docstore, similarity_top_k=config.top_k )

retriever = QueryFusionRetriever( [vector_retriever, bm25_retriever], similarity_top_k=config.top_k, num_queries=4, # set this to 1 to disable query generation mode="reciprocal_rerank", use_async=True, verbose=True, ) citation_query_engine = CitationQueryEngine.from_args( llm=final_response_llm, retriever=retriever, index=index,

here we can control how granular citation sources are, set to match the max ingestion chunk size

citation_chunk_size=DEFAULT_MAX_TOKEN_SIZE,
citation_qa_template=citation_qa_template,
citation_refine_template=citation_refine_template,
node_postprocessors=[text_processer],
similarity_top_k=config.top_k,
filters=data_filters,

)

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/anyio/streams/memory.py", line 97, in receive
    return self.receive_nowait()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/anyio/streams/memory.py", line 92, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/middleware/base.py", line 159, in call_next
    message = await recv_stream.receive()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/anyio/streams/memory.py", line 112, in receive
    raise EndOfStream
anyio.EndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/weimiao/Desktop/platform/gateway/gateway/api/middleware/log_middleware.py", line 29, in dispatch
    response = await call_next(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/middleware/base.py", line 165, in call_next
    raise app_exc
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/middleware/base.py", line 151, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Desktop/platform/gateway/gateway/api/endpoints/chat.py", line 44, in stream
    return await handle_chat_stream(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/sentry_sdk/tracing_utils_py3.py", line 41, in func_with_tracing
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Desktop/platform/gateway/gateway/messaging.py", line 487, in handle_chat_stream
    chat_engine, tools = get_chat_engine(
                         ^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/sentry_sdk/tracing_utils_py3.py", line 64, in func_with_tracing
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Desktop/platform/gateway/gateway/llama_index/engines.py", line 216, in get_chat_engine
    citation_query_engine = get_unstructure_data_retrieval_engine(system_prompt, config)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/sentry_sdk/tracing_utils_py3.py", line 64, in func_with_tracing
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Desktop/platform/gateway/gateway/llama_index/engines.py", line 165, in get_unstructure_data_retrieval_engine
    bm25_retriever = BM25Retriever.from_defaults(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/retrievers/bm25/base.py", line 73, in from_defaults
    return cls(
           ^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/llama_index/retrievers/bm25/base.py", line 40, in __init__
    self.bm25 = BM25Okapi(self._corpus)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/rank_bm25.py", line 83, in __init__
    super().__init__(corpus, tokenizer)
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/rank_bm25.py", line 27, in __init__
    nd = self._initialize(corpus)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/weimiao/Library/Caches/pypoetry/virtualenvs/gateway-7LwWi3Lq-py3.11/lib/python3.11/site-packages/rank_bm25.py", line 52, in _initialize
    self.avgdl = num_doc / self.corpus_size
                 ~~~~~~~~^~~~~~~~~~~~~~~~~~
ZeroDivisionError: division by zero
dosubot[bot] commented 7 months ago

The "division by zero" error you're encountering with the BM25Retriever typically happens when the algorithm tries to calculate scores for a query against an empty corpus, or the documents in your corpus result in no tokens after preprocessing. This can occur if the documents are empty, or if the tokenization process filters out all content from the documents.

To resolve this, ensure your corpus is not empty and contains valid, tokenizable content. If your corpus is valid but you're still facing this issue, it might be due to the tokenization process removing all tokens from your documents, leaving them empty. In such cases, you might need to adjust the tokenization process to ensure that documents are left with tokens after preprocessing.

Here's a modification you can make to the BM25Retriever initialization to filter out empty token lists and avoid the "division by zero" error:

class BM25Retriever(BaseRetriever):
    def __init__(
        self,
        nodes: List[BaseNode],
        tokenizer: Optional[Callable[[str], List[str]]] = None,
        similarity_top_k: int = 100,
        verbose: bool = False,
    ) -> None:
        self._nodes = nodes
        self._tokenizer = tokenizer or default_tokenizer
        # Tokenize the content of each node and filter out empty token lists
        self._corpus = [self._tokenizer(node.content) for node in self._nodes if self._tokenizer(node.content)]
        if not self._corpus:
            raise ValueError("The corpus is empty or all documents resulted in no tokens after tokenization.")
        self.bm25 = BM25Okapi(self._corpus)

This ensures that the corpus passed to BM25Okapi does not contain any empty token lists, which should help in avoiding the "division by zero" error. Additionally, it raises a ValueError if the corpus is empty or if all documents result in no tokens after tokenization, allowing for easier debugging and error handling.

Please try adjusting your tokenization process or ensuring your corpus contains valid content and let me know if the issue persists.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 7 months ago

This only happens if index.docstore is empty

My guess is you are using some vector db integration

Try passing in the nodes directly BM25Retriever.from_defaults(nodes=nodes, ...)

mw19930312 commented 7 months ago

@logan-markewich Thanks for the reply! We initiated the vector db by

index = VectorStoreIndex(
            nodes=[],
            embed_model=opensearch_client.os_client.embedding_model,
            storage_context=StorageContext.from_defaults(vector_store=vector_store),
            callback_manager=callback_manager
        )

and this to initialize the bm retriever.

bm25_retriever = BM25Retriever.from_defaults(
            docstore=index.docstore, similarity_top_k=config.top_k
        )

Are you suggesting that the BM25Retriever is not dynamically updated as the index has more nodes (the CitationQueryEngine is able to do so)? I kinda feel that breaking into the pipeline is nasty...

Also, what is the nodes in your previous comments? Are they all the nodes in vector db? I suppose bm25 returns the top k nodes from all existing nodes in the db, right?

logan-markewich commented 7 months ago

The BM25 retriever does not interact with your vector db

It needs the nodes, and stores them in memory, and its up to you to get the nodes to create the retriever.

This either comes from the docstore, but when using a vector db integration, the docstore is disabled to simplify storage. You could manually create and maintain a docstore with the nodes if you wanted.

Otherwise, in your case, you need a way to get the nodes from your vector db so that you can create the retriever

mw19930312 commented 7 months ago

@logan-markewich Thanks for the reply! After carefully going over our code, I'm considering overwriting the retrieve function in llama index for our use case. However, I'm not sure how to retrieve all the nodes available in the index and pass it to the bm25 retriever. How can I do it? Or should I overwrite something higher-level?

async def aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        nodes = await self._retriever.aretrieve(query_bundle)

        for postprocessor in self._node_postprocessors:
            nodes = postprocessor.postprocess_nodes(nodes, query_bundle=query_bundle)

        return nodes

where self._retriever is a base retriever witht he aretrieve function defined as

 async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:
        self._check_callback_manager()
        dispatch_event = dispatcher.get_dispatch_event()

        dispatch_event(
            RetrievalStartEvent(
                str_or_query_bundle=str_or_query_bundle,
            )
        )
        if isinstance(str_or_query_bundle, str):
            query_bundle = QueryBundle(str_or_query_bundle)
        else:
            query_bundle = str_or_query_bundle
        with self.callback_manager.as_trace("query"):
            with self.callback_manager.event(
                CBEventType.RETRIEVE,
                payload={EventPayload.QUERY_STR: query_bundle.query_str},
            ) as retrieve_event:
                import pdb; pdb.set_trace()
                nodes = await self._aretrieve(query_bundle=query_bundle)
                nodes = await self._ahandle_recursive_retrieval(
                    query_bundle=query_bundle, nodes=nodes
                )
                retrieve_event.on_end(
                    payload={EventPayload.NODES: nodes},
                )
        dispatch_event(
            RetrievalEndEvent(
                str_or_query_bundle=str_or_query_bundle,
                nodes=nodes,
            )
        )
        return nodes
logan-markewich commented 7 months ago

You could retrieve with a super high top k (like 20000) and give those nodes to bm25. Or just store your nodes somewhere they are easy to access (like a docstore)

logan-markewich commented 7 months ago

I don't think overriding methods is quite nesccary here. But up to you

abdelatifsd commented 5 months ago

The problem with the suggested fixes is that you will always have to load nodes in memory, rendering the optimizations offered by the vector DB service pointless. The point is to leverage them using your search method of choice. Not only that, but in the context of a real-time app, you will have to always manually update your BM25-based query engine to sync with your third-party index. BM25 in llama, and any other search method, should have the ability to be synced with the vector-store of choice.

logan-markewich commented 5 months ago

@abdelatifsd bm25 in llama-index is using the rank-bm25 library, so it's limited to what that package offers, which is all in memory.

Some vector dbs have bm25 built in.

The thing either bm25 is that it's static. If any documents are added to the index, ALL sparse embeddings need to he updated.