[Feature Request]: Add batch retrieval because the current implementation is not feasible even on small datasets

hkristof03 commented 6 months ago

Feature Description

Hi, thanks for this awesome library.

I am trying to benchmark the components of a RAG pipeline until the Retrieval component (chunking, embedding models, rerankers, etc.) on QUAC and SQUAD datasets, which are relatively small.

I am using the following components:

RetrieverEvaluator EmbeddingQAFinetuneDataset aevaluate_dataset()

I am also using FAISS based on this tutorial: https://docs.llamaindex.ai/en/stable/examples/vector_stores/FaissIndexDemo.html

The problem is that the retrieval implementation is inefficient, because it doesn't utilize the batch search capabilities of FAISS and instead iterates on the queries and performs the search one by one, based on your code below.

https://github.com/run-llama/llama_index/blob/da5f941662b65d2e3fe2100f2b58c3ba98d49e90/llama-index-core/llama_index/core/evaluation/retrieval/base.py#L189

Moreover, I don't find any computation speed up when I use multiple workers. The computation time is the almost the same when I use 1 and 10 workers.

For comparison and from my experience, using FAISS with a vector space cardinality ~140k, with much higher dimensionality than I tried here - 384, querying the whole vector space with itself in one batch took around ~3 minutes at most on the same hardware (commercial high-end laptop). Here, I tried with a cardinality of 3k, and it took ~10 minutes.

It is currently not feasible to use this feature even on small datasets.

Reason

The current implementation of retrieval evaluation is not usable due to its slowness.

Value of Feature

It would make it feasible to evaluate Retrieval components.

dosubot[bot] commented 6 months ago

🤖

Hey @hkristof03, great to see you back! Hope you're doing well.

Thank you for your detailed feature request and for pointing out the inefficiency in the current retrieval implementation. I understand your concern about the current implementation not utilizing the batch search capabilities of FAISS, which results in slower performance, especially on larger datasets.

The current implementation of the QueryFusionRetriever class in LlamaIndex indeed iterates over each query and retriever one by one, which can be inefficient. However, modifying the _run_sync_queries and _run_async_queries methods to perform a batch search on the FAISS index would require the addition of new methods to the retriever classes, which is not recommended unless explicitly requested or if the suggested feature requires it.

That being said, your suggestion to utilize the batch search capabilities of FAISS is valid and could significantly improve the efficiency of the retrieval implementation in LlamaIndex. This would involve collecting all queries and performing a batch search on the FAISS index, instead of iterating over each query one by one.

However, implementing this feature would require a significant change in the current codebase and might introduce new complexities. It would be best if this feature is implemented by the maintainers of the LlamaIndex repository, as they have a deep understanding of the codebase and can ensure that the new feature integrates well with the existing code.

I recommend creating a new issue in the LlamaIndex repository detailing your feature request. This will allow the maintainers to evaluate your request and decide on the best course of action. Please include as much detail as possible in your issue, including the problem you're facing, your proposed solution, and the benefits of implementing your feature.

I hope this helps, and thank you for your contribution to improving LlamaIndex. If you have any other questions or need further clarification, please don't hesitate to ask.

For more information on the current structure of the QueryFusionRetriever class in the LlamaIndex codebase, you can refer to the source code.

For more information on how to use FAISS for batch search, you can refer to the FAISS documentation.

For more information on how to create an issue in a GitHub repository, you can refer to the GitHub documentation.

Please note that Dosu is an AI and does not have the ability to create pull requests or issues.

Sources

llama-index-core/llama_index/core/retrievers/fusion_retriever.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

hkristof03 commented 6 months ago

@dosu-bot Hi AI bro!

I haven't mentioned QueryFusionRetriever. The method aevaluate_dataset is a part of BaseRetrievalEvaluator class. The aevaluate_dataset expects a dataset: EmbeddingQAFinetuneDataset parameter. I think from the name it is clear that both the method and the dataset were designed for fine-tuning and its related evaluation. I think the design decision to retrieve nearest neighbors from the vector store one by one is not a good one, as you already confirmed that the process is unreasonably slow for evaluation. The one by one async retrieval is good when the system is deployed / streaming, but for that, you don't need an 'EmbeddingQAFinetuneDataset', nor an aevaluate_dataset.

The current implementation makes it unfeasible to evalaute the Retrieval component of the RAG pipeline even on small datasets. Can you recommend a feasible approach with the current version of the library, without implementing my own solution from scratch?

jphme commented 4 months ago

Also found there is currently no viable option to speed this up.

Didn't do detailed profiling but the pipeline is CPU bound and you can only use multiple processes efficiently when instantiating multiple retrievers (which themselves are GPU/VRAM-bound with a local embedding model?)...

dosubot[bot] commented 4 months ago

To address the inefficiencies with the current implementation of BaseRetrievalEvaluator and EmbeddingQAFinetuneDataset in LlamaIndex, especially regarding the utilization of FAISS's batch search capabilities and the performance with multiple workers, consider the following approaches:

Optimize FAISS Batch Search: Modify the BaseRetrievalEvaluator class to leverage FAISS's batch search capabilities. This involves changing the aevaluate method to retrieve a batch of nearest neighbors from the vector store instead of one by one, utilizing the search method of the FAISS index with a batch of queries. This adjustment should significantly enhance the retrieval process's speed.
Parallel Processing: Optimize the use of multiple workers. Although the current implementation doesn't show performance improvement with multiple workers, managing a pool of workers where each is responsible for a subset of the dataset could be beneficial. This can be achieved using Python's concurrent.futures module, ensuring the workload is evenly distributed among workers to avoid bottlenecks.
Dataset Cardinality Handling: Dynamically adjust the batch size or the number of queries processed in parallel based on the dataset's characteristics, especially for datasets with high variability in query complexity or length.
Custom BatchEvalRunner: Extend or modify the BatchEvalRunner class to better handle batch operations with FAISS, incorporating efficient batch search and parallel processing logic tailored to the characteristics of the QUAC and SQUAD datasets.
Monitoring and Profiling: Identify bottlenecks in the evaluation process using profiling tools. This can help pinpoint inefficiencies in the batch search or parallel processing implementation. Also, monitor system resources to ensure that the parallel processing implementation does not lead to resource contention.

These suggestions might require some adjustments based on your specific use case and the characteristics of your datasets.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 1 week ago

The pipeline is indeed hardware bound unless you use a remote vector store and models

remote vector stores and models support async, allowing you to run a bunch of stuff concurrently

Going to close this out because tbh I don't see batch retrieval being supported anytime soon, especially when async exists

run-llama / llama_index