run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.65k stars 4.89k forks source link

[Bug]: Tree Traversal Retrieval Issue in RAPTOR with Azure AI Search #14484

Open ZeinabSh12 opened 1 month ago

ZeinabSh12 commented 1 month ago

Bug Description

There is an issue with the tree_traversal retrieval mode in RAPTOR when using Azure AI Search as the vector database. The retrieval process fails to return any nodes at level 0, resulting in no final nodes being retrieved. This issue persists across varied test sets.

Environment:

Code Snippet:`# Define model and embedding name model_deployment_name = "gpt-4" embedding_deployment_name = "text-embedding-ada-002"

Simplified example of the RAPTOR pack setup

raptor_pack = RaptorPack( documents, embed_model=OpenAIEmbedding(model=embedding_deployment_name), llm=OpenAI(model=model_deployment_name, temperature=0.1), vector_store=vector_store, # Azure AI Search similarity_top_k=2, mode="tree_traversal", transformations=[SentenceSplitter(chunk_size=400, chunk_overlap=50)], verbose=True )

Retrieval call

nodes = raptor_pack.run("What baselines is raptor compared against?", mode="tree_traversal") print(len(nodes)) print(nodes[0].text if nodes else "No nodes retrieved") ` Issue Description:

  1. Nodes are successfully inserted at each level (0 to N).
  2. Parent IDs are correctly assigned and logged.
  1. Retrieval starts from the top level (level N) and works downwards.
  2. Nodes are retrieved successfully at higher levels (N-1, N-2, etc.).
  3. Retrieval fails at level 0, with no parent IDs retrieved, resulting in an empty final node list. Expected Behavior: Nodes should be successfully retrieved at level 0, allowing for complete traversal and retrieval of relevant nodes. Request: Please investigate the issue with the tree_traversal mode in RAPTOR when using Azure AI Search as the vector database. Any insights or fixes would be greatly appreciated.

Version

llama-index-core==0.10.37

Steps to Reproduce

This is the reference but I had run it with Azure ai search: raptor pack This is the code with Azure AI Search: `# Client for chat model openai_client = AzureOpenAI( azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), api_key = os.getenv("AZURE_OPENAI_API_KEY"), api_version = os.getenv("OPENAI_API_VERSION"), azure_deployment='gpt-4' )

Client for embedding model

embedding_client = AzureOpenAIEmbedding( api_base = os.getenv("AZURE_OPENAI_ENDPOINT"), azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), api_key = os.getenv("AZURE_OPENAI_API_KEY"),
api_version = os.getenv("OPENAI_API_VERSION"), azure_deployment="text-embedding-ada-002" ) # Define index name index_name: str = "llamaindex-raptor-index-v5"

Define search key

search_key= AzureKeyCredential(vector_store_password)

Initiate Search Client (To connect to Azure Cognitive Search) -> azure.search.documents.SearchClient

search_client = SearchClient(endpoint=vector_store_address, index_name=index_name, credential=search_key) index_client = SearchIndexClient(endpoint = vector_store_address, credential=search_key) metadata_fields = { 'page_label': 'page_label', 'level': ('level', MetadataIndexFieldType.INT32), 'parent_id': 'parent_id' }

vector_store = AzureAISearchVectorStore( search_or_index_client=index_client, filterable_metadata_field_keys=metadata_fields, index_name=index_name, index_management=IndexManagement.CREATE_IF_NOT_EXISTS, id_field_key="id", chunk_field_key="chunk", embedding_field_key="embedding", embedding_dimensionality=1536, metadata_string_field_key="metadata", doc_id_field_key="doc_id", language_analyzer="en.lucene", vector_algorithm_type="exhaustiveKnn", ) # Define the directory path dir_path = "data/raptor_sample_paper"

Load documents from the specified directory

documents = SimpleDirectoryReader(dir_path).load_data() documents`raptor_pack = RaptorPack( documents, embed_model=embedding_client, # used for embeddings llm=openai_client, # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=2, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode transformations=[ SentenceSplitter(chunk_size=400, chunk_overlap=50) ], # transformations applied for ingestion verbose=True # Enable verbose logging for debugging ) , nodes = raptor_pack.run( "What baselines is raptor compared against?", mode="tree_traversal" ) print(len(nodes)) print(nodes[0].text)

Relevant Logs/Tracbacks

Starting retrieval at level 2
Retrieving nodes at level 2
Retrieved parent IDs from level 2: ['id1', 'id2']
Starting retrieval at level 1
Retrieving nodes at level 1
Retrieved parent IDs from level 1: ['id3', 'id4']
Starting retrieval at level 0
Retrieving nodes at level 0
Retrieved parent IDs from level 0: []
No parent IDs retrieved at level 0
Final nodes retrieved: []
dosubot[bot] commented 1 month ago

Hey there, @ZeinabSh12! I'm here to help with any bugs, questions, or contributions you might have. Let's squash those bugs together! 🤖

To fix the issue with the tree_traversal mode in RAPTOR when using Azure AI Search as the vector database, where the retrieval process fails to return any nodes at level 0, you need to ensure that the level is decremented correctly in the tree_traversal_retrieval method. The current implementation does not decrement the level variable correctly within the loop.

Here is the corrected code for the tree_traversal_retrieval method:

async def tree_traversal_retrieval(self, query_str: str) -> Response:
    """Query the index as a tree, traversing the tree from the top down."""
    # get top k nodes for each level, starting with the top
    parent_ids = None
    nodes = []
    level = self.tree_depth - 1
    while level >= 0:
        # retrieve nodes at the current level
        if parent_ids is None:
            nodes = await self.index.as_retriever(
                similarity_top_k=self.similarity_top_k,
                filters=MetadataFilters(
                    filters=[MetadataFilter(key="level", value=level)]
                ),
            ).aretrieve(query_str)
            parent_ids = [node.id_ for node in nodes]
            if self._verbose:
                print(f"Retrieved parent IDs from level {level}: {parent_ids!s}")
        # retrieve nodes that are children of the nodes at the previous level
        elif parent_ids is not None and len(parent_ids) > 0:
            nested_nodes = await asyncio.gather(
                *[
                    self.index.as_retriever(
                        similarity_top_k=self.similarity_top_k,
                        filters=MetadataFilters(
                            filters=[MetadataFilter(key="parent_id", value=id_)]
                        ),
                    ).aretrieve(query_str)
                    for id_ in parent_ids
                ]
            )

            nodes = [node for nested in nested_nodes for node in nested]

            if self._verbose:
                print(f"Retrieved {len(nodes)} from parents at level {level}.")

        level -= 1
        parent_ids = None

    return nodes

The key change is moving the level -= 1 statement outside of the elif block to ensure that the level is decremented in each iteration of the loop. This ensures that the retrieval process correctly traverses down the tree levels [1].

Additionally, ensure that the Azure AI Search vector store is correctly configured to handle the metadata fields and vector search. The AzureAISearchVectorStore class in the llama_index library provides the necessary methods to create and manage the index, including setting up the metadata fields and vector search configurations. Make sure your index is properly set up with the required fields and configurations as shown in the provided AzureAISearchVectorStore class implementation [2][3].

To continue talking to Dosu, mention @dosu.