run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.7k stars 5.26k forks source link

[Question]: composite retrieval on top of the structured data #12504

Closed ipoletaev closed 4 months ago

ipoletaev commented 7 months ago

Question Validation

Question

Hi!

What's the right way of enforcing retrieval against both pandas query engine that does structured search over a table and a parallel vector index which implements semantic search over the same table (using rows as text representation) such that the parent recursive retriever if decides to even explore the table info always runs both pandas query engine and this vector index?

Namely:

# Non-table information.
other_nodes = [...]

# An entry node to make traversing table's information optional. 
table_entry_node = IndexNode(text="table description", index_id="table")

# Table information.
pandas_query_engine = PandasQueryEngine(...)

# Same table but with semantic search over rows.
table_rows_vector_index = VectorStoreIndex(...)

and I'd like see if it is possible to reuse existing implementation (if any) to accomplish this:

composite_retriever = SomeClassName([pandas_query_engine, table_rows_vector_index], ...)
hierarchical_retriever = retrievers.RecursiveRetriever("root",
         retriever_dict={"root": core.VectorStoreIndex(other_nodes + [table_entry_node]),
                                  "table": composite_retriever})

The motivation is that I already have a good re-ranker implementation to fuse together semantic search + table retrievals and I just want to always use both.

Thanks!

dosubot[bot] commented 7 months ago

It seems you're exploring how to integrate a PandasQueryEngine for structured search with a VectorStoreIndex for semantic search into a composite retriever, aiming to leverage both in a unified retrieval process. This approach is indeed innovative and aligns with advanced search and retrieval paradigms.

In the context of the LlamaIndex repository, while there isn't a direct example that matches your exact use case, the repository does contain examples of how different search methods can be combined. For instance, the SummaryIndex class demonstrates the aggregation of different node types (TextNode and IndexNode) for retrieval through a single instance. This is akin to creating a composite retriever, as seen in the test_composable_retrieval function within the test_composable_retriever.py file.

Moreover, the VectorIndexAutoRetriever class, detailed in auto_retriever.py, showcases an approach to automatically set vector store query parameters using a Language Model (LLM). This class essentially acts as a composite retriever by utilizing VectorStoreIndex for vector-based retrieval, which could be adapted to include structured search functionalities akin to those provided by a PandasQueryEngine.

While these examples provide a foundation for combining different search methods, they may not fully address your requirement to run both the pandas query engine and the vector index in parallel, ensuring both methods are always utilized. Implementing this specific functionality might require additional logic to orchestrate the parallel execution and integration of results from both search methods.

Adapting the mentioned examples to fit your specific needs could be a viable starting point. If you require further assistance or have more detailed questions about implementing this composite retriever, feel free to provide additional context or specifics about your implementation goals.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.