[Bug]: LlamaIndex Not Returning similarity_top_k Chunks with Azure Cognitive Search

FatimaKhan04 commented 1 month ago

Bug Description

When using LlamaIndex with Azure Cognitive Search, setting similarity_top_k=20 and max_search_results=20 doesn't retrieve the expected 20 chunks. Instead, fewer chunks (e.g., 7-8) are returned.

Version

0.11.13

Steps to Reproduce

Initialize AzureAISearchVectorStore with max_search_results=20.
Set similarity_top_k=20 in the retriever configuration.
Execute a query using the LlamaIndex query engine.
Observe that fewer than 20 chunks are returned.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

Hey @FatimaKhan04, I'm here to help you with bug fixes, questions, and more! Let's squash those bugs together. 😊

I found a similar closed issue that might be relevant to your problem. The issue was discussed in ^[1], and the suggested solution was to set the retriever with the similarity_top_k parameter like this:

retriever = index.as_retriever(similarity_top_k=5)

In your case, you might want to ensure that the similarity_top_k and max_search_results parameters are correctly set in the VectorIndexRetriever class. The _retrieve method in this class retrieves nodes based on the similarity_top_k parameter, but the final result can be influenced by additional constraints such as metadata filters, document IDs, and the specific query mode used.

Here is a relevant code snippet from the VectorIndexRetriever class:

def _retrieve(
    self,
    query_bundle: QueryBundle,
) -> List[NodeWithScore]:
    if self._vector_store.is_embedding_query:
        if query_bundle.embedding is None and len(query_bundle.embedding_strs) > 0:
            query_bundle.embedding = (
                self._embed_model.get_agg_embedding_from_queries(
                    query_bundle.embedding_strs
                )
            )
    return self._get_nodes_with_embeddings(query_bundle)

The final number of chunks returned can be influenced by the availability of relevant nodes, the effectiveness of the embedding model, and any additional filters or constraints applied during the query process ^[2].

Make sure to check these aspects in your configuration to ensure that the parameters are correctly set and that there are no additional constraints affecting the retrieval process.

To continue talking to Dosu, mention @dosu.

FatimaKhan04 commented 1 month ago

@dosu I'm already doing exactly as suggested by setting similarity_top_k=20 in my configuration, as mentioned in my original report. I've verified that similarity_top_k and max_search_results are set correctly. Additionally, I have confirmed that there are enough nodes available in my index (84 nodes in total), so availability isn't the issue.

It seems that despite these settings, the retrieval process still returns fewer chunks than expected. Could you please check if there's any other internal constraint that might be causing this?

dosubot[bot] commented 1 month ago

Yes, there are internal constraints in the VectorIndexRetriever class that could limit the number of chunks returned, even if similarity_top_k and max_search_results are set correctly. Specifically, the filters, node_ids, and doc_ids parameters can constrain the search results:

class VectorIndexRetriever(BaseRetriever):
    ...
    def __init__(
        self,
        ...
        filters: Optional[MetadataFilters] = None,
        node_ids: Optional[List[str]] = None,
        doc_ids: Optional[List[str]] = None,
        ...
    ) -> None:
        ...
        self._filters = filters
        self._node_ids = node_ids
        self._doc_ids = doc_ids
        ...

These parameters are used to build the VectorStoreQuery:

def _build_vector_store_query(
    self, query_bundle_with_embeddings: QueryBundle
) -> VectorStoreQuery:
    return VectorStoreQuery(
        ...
        node_ids=self._node_ids,
        doc_ids=self._doc_ids,
        filters=self._filters,
        ...
    )

If any of these parameters (filters, node_ids, doc_ids) are set, they will limit the search results returned by the VectorIndexRetriever, regardless of the similarity_top_k and max_search_results settings. Additionally, the sparse_top_k parameter can also influence the results if it is set ^[1]^[2]^[3].

Moreover, the _get_nodes_with_embeddings method applies additional filtering or ranking that could reduce the number of returned chunks. It uses parameters like similarity_top_k, node_ids, doc_ids, filters, sparse_top_k, and hybrid_top_k to constrain the search results:

def _build_vector_store_query(
    self, query_bundle_with_embeddings: QueryBundle
) -> VectorStoreQuery:
    return VectorStoreQuery(
        query_embedding=query_bundle_with_embeddings.embedding,
        similarity_top_k=self._similarity_top_k,
        node_ids=self._node_ids,
        doc_ids=self._doc_ids,
        query_str=query_bundle_with_embeddings.query_str,
        mode=self._vector_store_query_mode,
        alpha=self._alpha,
        filters=self._filters,
        sparse_top_k=self._sparse_top_k,
        hybrid_top_k=self._hybrid_top_k,
    )

These parameters filter and rank the results, which can reduce the number of returned chunks ^[2]^[1].

Additionally, the integration with Azure Cognitive Search has specific behavior that could affect the retrieval process. The content_field parameter in the AzCognitiveSearchReader class specifies the field name of the document content. If this field does not exist in the search results, it could lead to issues in retrieving the document content correctly ^[4]^[5].

Please check these parameters and ensure that no additional constraints are affecting the retrieval process.

To continue talking to Dosu, mention @dosu.

FatimaKhan04 commented 1 month ago

@dosu I've implemented all the recommendations you provided:

Verified Field Names: Confirmed that all field names in my code match those in my Azure Cognitive Search index.
Checked for Constraints: Ensured that there are no additional constraints like filters, node_ids, or doc_ids affecting the retrieval.
Configured Retriever: Set similarity_top_k=20 when creating the query engine.

Despite these changes, I'm still not receiving the expected number of chunks; the issue remains unresolved

dosubot[bot] commented 1 month ago

It seems like you've already taken several steps to address the issue, including verifying field names, checking for constraints, and configuring the retriever correctly. However, the problem persists.

One potential area to investigate further is the configuration of the vector search algorithms and profiles in your Azure AI Search index. Specifically, the ef_search parameter in the HNSW algorithm configuration can affect the number of results returned by the search query. The default value for ef_search is 500, but you can try increasing it to ensure that the search algorithm explores more nodes and potentially returns more results.

Here is an example of how you might adjust the ef_search parameter:

HnswAlgorithmConfiguration(
    name="myHnsw",
    kind=VectorSearchAlgorithmKind.HNSW,
    parameters=HnswParameters(
        m=4,
        ef_construction=400,
        ef_search=1000,  # Increase this value to improve the number of results
        metric=VectorSearchAlgorithmMetric.COSINE,
    ),
)

Additionally, ensure that the similarity_top_k parameter is correctly passed to the search query. The top parameter in the search method should be set to the value of similarity_top_k:

results = self._search_client.search(
    search_text=search_query,
    vector_queries=vector_queries,
    top=self._query.similarity_top_k,  # Ensure this is set correctly
    select=self._select_fields,
    filter=self._odata_filter,
)

By increasing the ef_search parameter and ensuring the similarity_top_k parameter is correctly set, you may improve the likelihood of retrieving the expected number of chunks ^[1]^[2]^[3].

Given the complexity of this issue, it might also be helpful to get additional insights from a maintainer. @logan-markewich, could you take a look at this issue?

To continue talking to Dosu, mention @dosu.

FatimaKhan04 commented 1 month ago

Increasing ef_search from 500 to 1000 worked, and I’m now able to retrieve the expected number of chunks.

run-llama / llama_index