spring-projects / spring-ai

An Application Framework for AI Engineering
https://docs.spring.io/spring-ai/reference/index.html
Apache License 2.0
3.21k stars 812 forks source link

Elasticsearch vector store - Wrong error reported when missing index #1316

Closed l-trotta closed 1 month ago

l-trotta commented 1 month ago

Hello, I'm the author of parts of the implementation of the Elasticsearch Vector store (see #592).

I was testing the new release and I noticed that this PR introduced a check that prevents documents to be added to the bulk only if the index already exists.

This is problematic because if the user set initializeSchema as false in the initial configuration and did not previously create the index (which is allowed by Elasticsearch, since it will automatically configure and create the index), the user will receive the following error:

Caused by: co.elastic.clients.util.MissingRequiredPropertyException: Missing required property 'BulkRequest.operations'
    at co.elastic.clients.util.ApiTypeHelper.requireNonNull(ApiTypeHelper.java:76) ~[elasticsearch-java-8.13.4.jar:na]
    at co.elastic.clients.util.ApiTypeHelper.unmodifiableRequired(ApiTypeHelper.java:141) ~[elasticsearch-java-8.13.4.jar:na]
    at co.elastic.clients.elasticsearch.core.BulkRequest.<init>(BulkRequest.java:122) ~[elasticsearch-java-8.13.4.jar:na]
    at co.elastic.clients.elasticsearch.core.BulkRequest.<init>(BulkRequest.java:77) ~[elasticsearch-java-8.13.4.jar:na]
    at co.elastic.clients.elasticsearch.core.BulkRequest$Builder.build(BulkRequest.java:518) ~[elasticsearch-java-8.13.4.jar:na]
    at org.springframework.ai.vectorstore.ElasticsearchVectorStore.doAdd(ElasticsearchVectorStore.java:130) ~[spring-ai-elasticsearch-store-1.0.0-20240904.212955-466.jar:1.0.0-SNAPSHOT]
    at com.example.demo.Service.init(Service.java:88) ~[classes/:na]
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
    at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMethod.invoke(InitDestroyAnnotationBeanPostProcessor.java:457) ~[spring-beans-6.1.10.jar:6.1.10]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:401) ~[spring-beans-6.1.10.jar:6.1.10]
    at org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:219) ~[spring-beans-6.1.10.jar:6.1.10]
    ... 18 common frames omitted

which doesn't really explain what is going on, as this is an exception that occurs when a BulkRequest is built without the necessary properties.

I have 3 possible solutions for this:

  1. Remove the check, let Elasticsearch create the index automatically with the default configuration (so only cosine allowed as similarity function).
  2. Leave the check, throwing an appropriate exception if the index does not exist.
  3. Leave the check, throwing the exception only if the user selected a similarity function different from cosine.

Let me know which one is the more appropriate and I will implement it.

Personally I would go with number 3, so that we both keep the benefit of autoconfiguration when possible, and avoid users not knowing when the index actually needs to be configured or not; in any case I'd like to add some more information around the index creation both in the code and the documentation.

Thank you for your time.

sobychacko commented 1 month ago

@l-trotta Either option 2 or 3 is fine. I would have gone with option 2 because if the index does not exist at that point in the flow, we want to fail fast regardless of the similarity function type. How does similarity type play a role here? Thanks!

l-trotta commented 1 month ago

@sobychacko basically if no index exist when vectors are added, Elasticsearch is capable of creating an index with the correct configuration, understanding which vector dimension to use and which mapping to create, so for default cases there's no need for the user to configure ElasticsearchVectorStoreOptions nor create an index.

The only limitation is that the automatically created index will always be configured to use cosine as similarity function, while the other options available (dot_product and l2_norm) must always be manually configured.

So option number 3 allows inexperienced users not to worry about configuring the correct vector dimension or any configuration and to just easily setup the most basic usecase regardless of which EmbeddingModel they are using.

sobychacko commented 1 month ago

If I understand you correctly, you are proposing that if the index does not exist and the similarity function provided is the default cosine, then we allow Elastic to create the index. This is the change we disabled for all the vector stores by globally turning initialize-schema to false. By making such a change, I am afraid we are creating an exception to this rule. We have to fail fast if the index does not exist. In other words, there is no need to check the similarity type in the conditional, but if the indexExists() is false, throw the exception. What do you think?

l-trotta commented 1 month ago

Got it, that's a pity, but it's better not to create special cases. I updated the PR to simply fail if there's no index as you proposed (and updated the documentation to reflect the changes). I also added a default user header for observability, the explanation is in the PR description.