run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.13k stars 4.93k forks source link

[Feature Request]: Implement a clustering-based retrieval method for RAG pipelines. #14614

Open Kirushikesh opened 1 month ago

Kirushikesh commented 1 month ago

Feature Description

Implement a clustering-based retrieval method for RAG pipelines using algorithms like DBSCAN. This feature would cluster document embeddings during indexing and retrieve documents based on cluster proximity during query time. The approach would involve: a) Clustering document embeddings during the indexing phase. b) Finding the nearest cluster(s) to the query embedding during retrieval. c) Retrieving top-k documents from the identified cluster(s). d) Optionally implementing a hybrid approach, using clustering for initial filtering and vector similarity for final ranking within clusters.

Reason

While LlamaIndex currently supports various vector store integrations for similarity search, it doesn't offer a clustering-based retrieval method. This approach hasn't been tried within LlamaIndex yet, but it has shown promise in other systems. For instance, SweepAI uses DBSCAN for clustering and retrieval in their assistant, demonstrating the potential of this approach in production environments.

Value of Feature

This feature could provide several benefits: a) Improved efficiency for large-scale document retrieval, potentially reducing search time for massive datasets. b) Better thematic grouping of documents, which could enhance context-aware retrieval. c) Possibility of identifying and handling outlier documents, improving retrieval quality. d) Potential for hierarchical retrieval strategies if implemented with hierarchical clustering algorithms. e) Enhanced flexibility for users dealing with diverse retrieval scenarios, particularly in large-scale or topic-focused applications.

logan-markewich commented 1 month ago

Technically this is just what raptor is, but a little smarter. Been meaning to make the raptor llama-pack a more dedicated/supported module