Open ZeinabSh12 opened 1 month ago
Hey there, @ZeinabSh12! I'm here to help with any bugs, questions, or contributions you might have. Let's squash those bugs together! 🤖
To fix the issue with the tree_traversal
mode in RAPTOR when using Azure AI Search as the vector database, where the retrieval process fails to return any nodes at level 0, you need to ensure that the level
is decremented correctly in the tree_traversal_retrieval
method. The current implementation does not decrement the level
variable correctly within the loop.
Here is the corrected code for the tree_traversal_retrieval
method:
async def tree_traversal_retrieval(self, query_str: str) -> Response:
"""Query the index as a tree, traversing the tree from the top down."""
# get top k nodes for each level, starting with the top
parent_ids = None
nodes = []
level = self.tree_depth - 1
while level >= 0:
# retrieve nodes at the current level
if parent_ids is None:
nodes = await self.index.as_retriever(
similarity_top_k=self.similarity_top_k,
filters=MetadataFilters(
filters=[MetadataFilter(key="level", value=level)]
),
).aretrieve(query_str)
parent_ids = [node.id_ for node in nodes]
if self._verbose:
print(f"Retrieved parent IDs from level {level}: {parent_ids!s}")
# retrieve nodes that are children of the nodes at the previous level
elif parent_ids is not None and len(parent_ids) > 0:
nested_nodes = await asyncio.gather(
*[
self.index.as_retriever(
similarity_top_k=self.similarity_top_k,
filters=MetadataFilters(
filters=[MetadataFilter(key="parent_id", value=id_)]
),
).aretrieve(query_str)
for id_ in parent_ids
]
)
nodes = [node for nested in nested_nodes for node in nested]
if self._verbose:
print(f"Retrieved {len(nodes)} from parents at level {level}.")
level -= 1
parent_ids = None
return nodes
The key change is moving the level -= 1
statement outside of the elif
block to ensure that the level is decremented in each iteration of the loop. This ensures that the retrieval process correctly traverses down the tree levels [1].
Additionally, ensure that the Azure AI Search vector store is correctly configured to handle the metadata fields and vector search. The AzureAISearchVectorStore
class in the llama_index
library provides the necessary methods to create and manage the index, including setting up the metadata fields and vector search configurations. Make sure your index is properly set up with the required fields and configurations as shown in the provided AzureAISearchVectorStore
class implementation [2][3].
To continue talking to Dosu, mention @dosu.
Bug Description
There is an issue with the tree_traversal retrieval mode in RAPTOR when using Azure AI Search as the vector database. The retrieval process fails to return any nodes at level 0, resulting in no final nodes being retrieved. This issue persists across varied test sets.
Environment:
Code Snippet:`# Define model and embedding name model_deployment_name = "gpt-4" embedding_deployment_name = "text-embedding-ada-002"
Simplified example of the RAPTOR pack setup
raptor_pack = RaptorPack( documents, embed_model=OpenAIEmbedding(model=embedding_deployment_name), llm=OpenAI(model=model_deployment_name, temperature=0.1), vector_store=vector_store, # Azure AI Search similarity_top_k=2, mode="tree_traversal", transformations=[SentenceSplitter(chunk_size=400, chunk_overlap=50)], verbose=True )
Retrieval call
nodes = raptor_pack.run("What baselines is raptor compared against?", mode="tree_traversal") print(len(nodes)) print(nodes[0].text if nodes else "No nodes retrieved") ` Issue Description:
Version
llama-index-core==0.10.37
Steps to Reproduce
This is the reference but I had run it with Azure ai search: raptor pack This is the code with Azure AI Search: `# Client for chat model openai_client = AzureOpenAI( azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), api_key = os.getenv("AZURE_OPENAI_API_KEY"), api_version = os.getenv("OPENAI_API_VERSION"), azure_deployment='gpt-4' )
Client for embedding model
embedding_client = AzureOpenAIEmbedding( api_base = os.getenv("AZURE_OPENAI_ENDPOINT"), azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), api_key = os.getenv("AZURE_OPENAI_API_KEY"),
# Define index name index_name: str = "llamaindex-raptor-index-v5"
api_version = os.getenv("OPENAI_API_VERSION"), azure_deployment="text-embedding-ada-002" )
Define search key
search_key= AzureKeyCredential(vector_store_password)
Initiate Search Client (To connect to Azure Cognitive Search) -> azure.search.documents.SearchClient
search_client = SearchClient(endpoint=vector_store_address, index_name=index_name, credential=search_key) index_client = SearchIndexClient(endpoint = vector_store_address, credential=search_key)
metadata_fields = { 'page_label': 'page_label', 'level': ('level', MetadataIndexFieldType.INT32), 'parent_id': 'parent_id' }
vector_store = AzureAISearchVectorStore( search_or_index_client=index_client, filterable_metadata_field_keys=metadata_fields, index_name=index_name, index_management=IndexManagement.CREATE_IF_NOT_EXISTS, id_field_key="id", chunk_field_key="chunk", embedding_field_key="embedding", embedding_dimensionality=1536, metadata_string_field_key="metadata", doc_id_field_key="doc_id", language_analyzer="en.lucene", vector_algorithm_type="exhaustiveKnn", )
# Define the directory path dir_path = "data/raptor_sample_paper"
Load documents from the specified directory
documents = SimpleDirectoryReader(dir_path).load_data() documents
`raptor_pack = RaptorPack( documents, embed_model=embedding_client, # used for embeddings llm=openai_client, # used for generating summaries vector_store=vector_store, # used for storage similarity_top_k=2, # top k for each layer, or overall top-k for collapsed mode="tree_traversal", # sets default mode transformations=[ SentenceSplitter(chunk_size=400, chunk_overlap=50) ], # transformations applied for ingestion verbose=True # Enable verbose logging for debugging )
,nodes = raptor_pack.run( "What baselines is raptor compared against?", mode="tree_traversal" ) print(len(nodes)) print(nodes[0].text)
Relevant Logs/Tracbacks