[BUG] Unable to connect to an existing Azure AI Search index

dionoid commented 4 months ago

Describe the bug I'm using LangChain4j with an existing Azure AI Search index, which was created using the "Import and vectorize data" feature of Azure AI Search. When connecting this index using the AzureAiSearchContentRetriever, I found that the underlying AbstractAzureAiSearchEmbeddingStore doesn't allow me to override the default field names, metadata or index name, so I was blocked. Also, metadata mapping in the AzureAiSearchContentRetriever seems to be limited to pure Vector queries and not implemented for FullText, Hybrid or HybridWithReranking.

Log and Stack trace N.A.

To Reproduce Import and vectorize documents into a new Azure AI Search index using the "Import and vectorize data", or use the "Add your data" feature in the playground of Azure OpenAI Studio. Then there is no way to connect these indexes to the AzureAiSearchContentRetriever and use them in LangChain4j.

Expected behavior

It would be nice if I can configure the Azure AI Search index field names in the AzureAiSearchContentRetriever.builder:
- INDEX_NAME
- DEFAULT_FIELD_ID
- DEFAULT_FIELD_CONTENT
- DEFAULT_FIELD_CONTENT_VECTOR
- SEMANTIC_SEARCH_CONFIG_NAME
Same goes for metadata mapping.

Please complete the following information:

LangChain4j version: 0.30.0
Java version: 17

Additional context I would be happy to contribute to this project. Reaching out to @jdubois to learn what he thinks the best way would be to solve this issue, and maybe we can work on this together? Also I have some additional questions:

The DEFAULT_FIELD_METADATA_SOURCE field with value "source" doesn't seem to be used anywhere? Would it be an option to move the key-value pairs (currently under metadata->attributes) to just metadata? Then "source" could be just a metadata key.
It seems the SearchIndex parameter of AbstractAzureAiSearchEmbeddingStore's initialize method is only used for creating a new index, but never for retrieving. Is this intended?

dosubot[bot] commented 4 months ago

To address the issue of customizing default field names, metadata, and index name in LangChain4j when using an existing Azure AI Search index, it seems like the current implementation doesn't provide direct support for this level of customization. Your suggestion to enhance the AzureAiSearchContentRetriever.builder to allow configuration of index field names and metadata mapping is valid and would indeed improve flexibility and usability.

For the DEFAULT_FIELD_METADATA_SOURCE field not being used, moving key-value pairs from metadata->attributes to just metadata and treating "source" as a metadata key could simplify the structure and make it more intuitive. This approach would require adjustments in the codebase to ensure compatibility and maintain the integrity of data retrieval and indexing processes.

Regarding the SearchIndex parameter in the AbstractAzureAiSearchEmbeddingStore's initialize method being used only for creating a new index and not for retrieving, this seems to be by design. However, revisiting this design could offer more flexibility in managing indexes, especially for use cases that involve connecting to and interacting with existing indexes.

Collaborating with the project maintainers, like reaching out to @jdubois as you mentioned, would be the best course of action to discuss these enhancements and potentially contribute to the project. Your willingness to contribute and your insights could lead to significant improvements in LangChain4j's Azure AI Search integration.

#### Details

**Sources** - [langchain4j-azure-ai-search/src/main/java/dev/langchain4j/store/embedding/azure/search/AbstractAzureAiSearchEmbeddingStore.java](https://github.com/langchain4j/langchain4j/blob/main/langchain4j-azure-ai-search/src/main/java/dev/langchain4j/store/embedding/azure/search/AbstractAzureAiSearchEmbeddingStore.java) - [langchain4j-azure-ai-search/src/main/java/dev/langchain4j/rag/content/retriever/azure/search/AzureAiSearchContentRetriever.java](https://github.com/langchain4j/langchain4j/blob/main/langchain4j-azure-ai-search/src/main/java/dev/langchain4j/rag/content/retriever/azure/search/AzureAiSearchContentRetriever.java) - [CONTRIBUTING.md](https://github.com/langchain4j/langchain4j/blob/main/CONTRIBUTING.md) - [.github/pull_request_template.md](https://github.com/langchain4j/langchain4j/blob/main/.github/pull_request_template.md) - [langchain4j-azure-ai-search/src/test/java/dev/langchain4j/rag/content/retriever/azure/search/AzureAiSearchContentRetrieverIT.java](https://github.com/langchain4j/langchain4j/blob/main/langchain4j-azure-ai-search/src/test/java/dev/langchain4j/rag/content/retriever/azure/search/AzureAiSearchContentRetrieverIT.java) - [langchain4j-pgvector/src/main/java/dev/langchain4j/store/embedding/pgvector/DefaultMetadataStorageConfig.java](https://github.com/langchain4j/langchain4j/blob/main/langchain4j-pgvector/src/main/java/dev/langchain4j/store/embedding/pgvector/DefaultMetadataStorageConfig.java)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

jdubois commented 4 months ago

Thanks @dionoid ! This seems linked to #1062 so let me finish that one first. I didn't know that Azure AI Search "Import and vectorize data" feature, this should be supported and documented! There are lots of room for improvement here, so happy to work with you on this, or help you contribute!

jdubois commented 4 months ago

@langchain4j can you assign this issue to me?

jdubois commented 2 months ago

@yoshioterada from my team at Microsoft might have the time to work on this during the summer, so I'm pinging him here.

AlexHird commented 1 week ago

Is there any progress here? I saw the pull request was closed on July 5th

jdubois commented 1 week ago

No sorry, I really didn't have the time lately, and I'm not sure when I will be able to work on this

langchain4j / langchain4j

[BUG] Unable to connect to an existing Azure AI Search index #1098