Closed l-trotta closed 1 month ago
@l-trotta Either option 2 or 3 is fine. I would have gone with option 2 because if the index does not exist at that point in the flow, we want to fail fast regardless of the similarity function type. How does similarity type play a role here? Thanks!
@sobychacko basically if no index exist when vectors are added, Elasticsearch is capable of creating an index with the correct configuration, understanding which vector dimension to use and which mapping to create, so for default cases there's no need for the user to configure ElasticsearchVectorStoreOptions
nor create an index.
The only limitation is that the automatically created index will always be configured to use cosine
as similarity function, while the other options available (dot_product
and l2_norm
) must always be manually configured.
So option number 3 allows inexperienced users not to worry about configuring the correct vector dimension or any configuration and to just easily setup the most basic usecase regardless of which EmbeddingModel they are using.
If I understand you correctly, you are proposing that if the index does not exist and the similarity function provided is the default cosine,
then we allow Elastic to create the index. This is the change we disabled for all the vector stores by globally turning initialize-schema
to false.
By making such a change, I am afraid we are creating an exception to this rule. We have to fail fast if the index does not exist. In other words, there is no need to check the similarity type in the conditional, but if the indexExists()
is false, throw the exception. What do you think?
Got it, that's a pity, but it's better not to create special cases. I updated the PR to simply fail if there's no index as you proposed (and updated the documentation to reflect the changes). I also added a default user header for observability, the explanation is in the PR description.
Hello, I'm the author of parts of the implementation of the Elasticsearch Vector store (see #592).
I was testing the new release and I noticed that this PR introduced a check that prevents documents to be added to the bulk only if the index already exists.
This is problematic because if the user set
initializeSchema
as false in the initial configuration and did not previously create the index (which is allowed by Elasticsearch, since it will automatically configure and create the index), the user will receive the following error:which doesn't really explain what is going on, as this is an exception that occurs when a BulkRequest is built without the necessary properties.
I have 3 possible solutions for this:
cosine
allowed as similarity function).cosine
.Let me know which one is the more appropriate and I will implement it.
Personally I would go with number 3, so that we both keep the benefit of autoconfiguration when possible, and avoid users not knowing when the index actually needs to be configured or not; in any case I'd like to add some more information around the index creation both in the code and the documentation.
Thank you for your time.