[Question]: Hybrid search?

chengyjonathan commented 7 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Hi there!

I'm wondering where would be the best place to get started if I wanted to implement a lexical, semantic, graph search hybrid.

I see the examples for doing lexical + semantic retrieval.

And I was wondering if there are any guides for doing semantic + kg retrieval.

dosubot[bot] commented 7 months ago

Hey @chengyjonathan! Great to see you venturing into new territories again 🚀

Based on the information provided in the LlamaIndex codebase, it appears that the framework already supports a hybrid search mechanism involving lexical, semantic, and knowledge graph retrieval. This is primarily facilitated through the KGTableRetriever class, where the retriever_mode parameter can be set to KGRetrieverMode.HYBRID, indicating a hybrid approach combining both keywords and embeddings to find relevant triplets in a knowledge graph.

The hybrid search functionality is implemented in the _retrieve method of the KGTableRetriever class. This method uses both lexical search (via keywords) and semantic search (via embeddings) to retrieve relevant nodes from the knowledge graph. If the retriever_mode is set to HYBRID, the method combines results from both lexical and semantic searches to enhance the retrieval process.

Here's a brief overview of how the hybrid search works:

Keyword Extraction: The method starts by extracting keywords from the query string using the _get_keywords method, which leverages a language model for keyword extraction.
Lexical Search: For each extracted keyword, it searches the knowledge graph for relevant nodes using the keyword search functionality (self._index_struct.search_node_by_keyword). This represents the lexical aspect of the search.
Semantic Search: If embeddings are used (self._retriever_mode != KGRetrieverMode.KEYWORD), the method computes the query embedding using the provided embedding model (self._embed_model.get_text_embedding). It then retrieves relevant text snippets based on their semantic similarity to the query embedding, using the embeddings stored in self._index_struct.embedding_dict.
Combining Results: If the retriever_mode is HYBRID, the method combines nodes found through both lexical and semantic searches. It ensures uniqueness of the results and prioritizes them based on relevance.
Final Node Retrieval: The method retrieves the final set of nodes based on the combined results from both searches, along with their associated text snippets if include_text is set to True.

You can find this implementation in the llama-index-core/llama_index/core/indices/knowledge_graph/retrievers.py file.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 7 months ago

I think any custom retriever could do this. Or using the query fusion retriever and passing in all the retrievers you want to mix

https://docs.llamaindex.ai/en/stable/examples/retrievers/relative_score_dist_fusion/?h=query+fusion

chengyjonathan commented 7 months ago

I think any custom retriever could do this. Or using the query fusion retriever and passing in all the retrievers you want to mix

https://docs.llamaindex.ai/en/stable/examples/retrievers/relative_score_dist_fusion/?h=query+fusion

Forgive me if this is wrong, but would I need two separate indexes? One composed of kg triplets?

Or could I just mix bm25, embedding based, and a kg based retriever?

logan-markewich commented 7 months ago

@chengyjonathan 3 retrievers, that then get composed into one. How you get those retrievers is up to you 👍🏻

anhthyngo commented 6 months ago

I think any custom retriever could do this. Or using the query fusion retriever and passing in all the retrievers you want to mix

https://docs.llamaindex.ai/en/stable/examples/retrievers/relative_score_dist_fusion/?h=query+fusion

@logan-markewich

If I had a KnowledgeGraphRAGRetriever and a VectorIndexAutoRetriever and wanted to fuse them - would the SQLAutoVectoryQueryEngine framework but for Knowledge Graphs be the correct approach over this Fusion retriever?

https://docs.llamaindex.ai/en/stable/examples/query_engine/SQLAutoVectorQueryEngine/

run-llama / llama_index