lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Should we use refresh=true on production? (Test is not suitable for 7.9) #61

Open bart2001 opened 3 years ago

bart2001 commented 3 years ago

Hello, I would like to raise an issue about the test case.

According to fast-elasticsearch-vector-scoring/src/test/java/com/liorkn/elasticsearch/PluginTest.java file, the test case uses params.put("refresh", "true") on data insertion request, which makes vector scoring result works properly.

public void test() throws Exception {
  final Map<String, String> params = new HashMap<>();
  params.put("refresh", "true");

However, in production case, we often do not use refresh=true option on data insertion. If we do not use this option, the vector scoring result does not work properly. (such as there are same cosine similarity score among result documents...)

I think one of two options should be considered

  1. Modify the test code (removing refresh=true) -> this may cause code level modification of vector scoring plugin
  2. Mention in the document that refresh=true option must be provided on data insertion
lior-k commented 3 years ago

There is no need to use refresh=true in production. This plugin does not require that. Refresh=true just means the inserted doc is quariable immediately after insertion, which is a must for the test. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html

bart2001 commented 3 years ago

Hello, thank you for your reply.

The issue can be specified like this.

- Case 1. insert single data without `refresh=true` => vector scoring fail (cosine similarity)
- Case 2. insert singe data with `refresh=true` => vector scoring success (cosine similarity)
- Case 3. bulk insert with `refresh=true` => vector scoring fail (cosine similarity)

In summary, I think the problem is refresh=true is required to get "proper vector scoring result"...

This issue may be connected to compatibility with elasitcsearch core since the above cases all worked as vector scoring success on elasticsearch 7.5.2 version.

lior-k commented 3 years ago

right, thanks for clarifying. I was just able to recreate it in a test I have no clue at the moment as of why this unexpected change in behaviour in this ES version. for the meanwhile we need to use refresh=true in doc insertions which is not ideal