run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.59k stars 5.24k forks source link

[Bug]: Vector embeddings not create in Azure AI search #16087

Open FestinaIsm opened 1 month ago

FestinaIsm commented 1 month ago

Bug Description

I'm creating the

vector_store = AzureAISearchVectorStore(
search_or_index_client=index_client,
filterable_metadata_field_keys=metadata_fields, index_name=index_name,
index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
id_field_key="id",
chunk_field_key="content",
embedding_field_key="embedding",
metadata_string_field_key="metadata", doc_id_field_key="doc_id", embedding_dimensionality=1536, language_analyzer="en.lucene", vector_algorithm_type="exhaustiveKnn" )

and initializing the index_client, together with the

storage_context = StorageContext.from_defaults(vector_store=vector_store) Settings.llm = llm Settings.embed_model = embed_model

They all are initiliazed correctly and the index is created. However, the "embedding" field is not present in the index, meaning that the vector embedding is missing alltogether.

Any ideas? Thank you!

Version

0.11.10

Steps to Reproduce

there is no error, it's just that the vector embeddings are not being created

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

Hey @FestinaIsm! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to tackle this issue!

I found a similar issue where the "embedding" field was missing in the index when creating a vector store with AzureAISearchVectorStore in version 0.11.10. The problem might be due to the embedding_field_key parameter not being correctly utilized or mapped in the actual implementation of the AzureAISearchVectorStore.

In the internal implementation of AzureAISearchVectorStore, the embedding_field_key should be mapped correctly to the index schema. The relevant part of the code shows that the embedding field is configured as a SearchField with the appropriate type and properties:

fields = [
    SimpleField(name=self._field_mapping["id"], type="Edm.String", key=True),
    SearchableField(
        name=self._field_mapping["chunk"],
        type="Edm.String",
        analyzer_name=self._language_analyzer,
    ),
    SearchField(
        name=self._field_mapping["embedding"],
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=self._embedding_dimensionality,
        vector_search_profile_name=self._vector_profile_name,
    ),
    SimpleField(name=self._field_mapping["metadata"], type="Edm.String"),
    SimpleField(
        name=self._field_mapping["doc_id"], type="Edm.String", filterable=True
    ),
]

This indicates that the embedding field should be present in the index schema if the embedding_field_key is correctly mapped. If the "embedding" field is still missing, it might be due to an issue in how the index is being created or updated. Ensure that the index creation process is correctly executed and that the embedding_field_key is properly utilized when defining the index schema [1][2][3].

Additionally, there is documentation in the codebase that explains the expected behavior and usage of the embedding_field_key in the AzureAISearchVectorStore [3][4][5].

To continue talking to Dosu, mention @dosu.

non-nil commented 1 month ago

tl;dr: the retrievable flag for vector fields (i.e. "embedding" in Llama Index) has changed to default False in the latest Azure SDK

After some digging, the issue seems to be introduced by changes in the latest Azure SDK schema:

non-nil commented 1 month ago

So if one wants to force the embedding field retrievable, we probably have to override the _create_index method:

    def _create_index(self, index_name: Optional[str]) -> None:
        ...

        fields = [
            SimpleField(name=self._field_mapping["id"], type="Edm.String", key=True),
            SearchableField(
                name=self._field_mapping["chunk"],
                type="Edm.String",
                analyzer_name=self._language_analyzer,
            ),
            SearchField(
                name=self._field_mapping["embedding"],

                hidden=False, # Force the `SearchField` to be retrievable   <------------------------

                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True,
                vector_search_dimensions=self._embedding_dimensionality,
                vector_search_profile_name=self._vector_profile_name,
            ),
            SimpleField(name=self._field_mapping["metadata"], type="Edm.String"),
            SimpleField(
                name=self._field_mapping["doc_id"], type="Edm.String", filterable=True
            ),
        ]

        ...