jina-ai / jina

☁️ Build multimodal AI applications with cloud-native stack
https://docs.jina.ai
Apache License 2.0
20.99k stars 2.22k forks source link

different sorting order for chunks and matches in query results #1224

Closed cristianmtr closed 3 years ago

cristianmtr commented 3 years ago

Describe your problem

When working on the multires-lyrics-example I noticed that the sorting differs between chunks and matches in the output from the search.

Chunks are sorted in ascending order, matches are sorted in descending order.

Is this intended? See screenshot below

What is your guess?

Not really sure if this is intended or not, but if this is indeed labeled as an issue, I can try to dig


Environment

$ jina --version-full                                                                                                                                                                                                                                  ✹ ✭test-multires-lyrics-upgrade 
jina                          0.7.4
jina-proto                    0.0.74
jina-vcs-tag                  (unset)
libzmq                        4.3.2
pyzmq                         1.19.4
protobuf                      3.13.0
proto-backend                 cpp
grpcio                        1.33.2
ruamel.yaml                   0.16.12
python                        3.7.5
platform                      Linux
platform-release              5.4.0-52-generic
platform-version              #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
architecture                  x86_64
processor                     x86_64
jina-resources                /home/cristian/envs/lyrics/lib/python3.7/site-packages/jina/resources
JINA_ARRAY_QUANT              (unset)
JINA_BINARY_DELIMITER         (unset)
JINA_CONTRIB_MODULE           (unset)
JINA_CONTRIB_MODULE_IS_LOADING(unset)
JINA_CONTROL_PORT             (unset)
JINA_DB_COLLECTION            (unset)
JINA_DB_HOSTNAME              (unset)
JINA_DB_NAME                  (unset)
JINA_DB_PASSWORD              (unset)
JINA_DB_USERNAME              (unset)
JINA_DEFAULT_HOST             (unset)
JINA_DISABLE_UVLOOP           (unset)
JINA_EXECUTOR_WORKDIR         (unset)
JINA_FULL_CLI                 (unset)
JINA_IPC_SOCK_TMP             (unset)
JINA_LOG_CONFIG               (unset)
JINA_LOG_NO_COLOR             (unset)
JINA_POD_NAME                 (unset)
JINA_PROFILING                (unset)
JINA_RANDOM_PORTS             (unset)
JINA_SOCKET_HWM               (unset)
JINA_TEST_GPU                 (unset)
JINA_TEST_PRETRAINED          (unset)
JINA_VCS_VERSION              (unset)
JINA_WARN_UNNAMED             (unset)

Screenshots

Left: matches Right: chunks

image

JoanFM commented 3 years ago

This is due to not having a Ranker or SortQL after the Chunk2DocRanker most likely?

maximilianwerk commented 3 years ago

I wonder, whether the ranker itself should not be able to have a order keyword. While it is true, that SortQL can do the job, already sorting in the right (configurable) order in the Chunk2DocRanker might be useful.

JoanFM commented 3 years ago

great catch!

The ranker Chunk2DocRanker only cares about assigning the score to the match but not reordering it. I think we need to add an SortQL as the last driver in the doc.yml for SearchRequest to sort by desceding score.

JoanFM commented 3 years ago

I wonder, whether the ranker itself should not be able to have a order keyword. While it is true, that SortQL can do the job, already sorting in the right (configurable) order in the Chunk2DocRanker might be useful.

Well, it would be in the driver.

nan-wang commented 3 years ago

Here is the root issue. The SortQL driver attached to the Chunk2DocRanker by default is not setting reverse argument. https://github.com/jina-ai/jina/blob/master/jina/resources/executors.requests.BaseRanker.yml#L9

JoanFM commented 3 years ago

Here is the root issue. The SortQL driver attached to the Chunk2DocRanker by default is not setting reverse argument. https://github.com/jina-ai/jina/blob/master/jina/resources/executors.requests.BaseRanker.yml#L9

In this case I am not sure it is even applied because I think we override the drivers. And if we do so the default ones are not considered right?

nan-wang commented 3 years ago

Here is the root issue. The SortQL driver attached to the Chunk2DocRanker by default is not setting reverse argument. https://github.com/jina-ai/jina/blob/master/jina/resources/executors.requests.BaseRanker.yml#L9

In this case I am not sure it is even applied because I think we override the drivers. And if we do so the default ones are not considered right?

u r right. Adding a SortQL to the pods/ranker.yml will solve the problem in the multilyric example. However, we need to offer the option in the Chunk2DocRankDriver because by default it is ranking in ascending order and do splicing. When the user considering the large score the better, this will cause trouble.

JoanFM commented 3 years ago

Here is the root issue. The SortQL driver attached to the Chunk2DocRanker by default is not setting reverse argument. https://github.com/jina-ai/jina/blob/master/jina/resources/executors.requests.BaseRanker.yml#L9

In this case I am not sure it is even applied because I think we override the drivers. And if we do so the default ones are not considered right?

u r right. Adding a SortQL to the pods/ranker.yml will solve the problem in the multilyric example. However, we need to offer the option in the Chunk2DocRankDriver because by default it is ranking in ascending order and do splicing. When the user considering the large score the better, this will cause trouble.

I agree but offering this solution kind of invalidates the use of SortQL as default attached driver right? Nothing against it, just raising the point

hanxiao commented 3 years ago

simply use SortQL

cristianmtr commented 3 years ago

From a discussion with @JoanFM , it seems like there is a bit of confusion around the score field.

Thus there is nothing wrong with the output of the example above.

The problem, to me, is a matter of documentation / conflicting definitions for the score field.

Possible solutions:

Something like

 "method": "MinRanker"  # or "Indexer

in the output

image

cristianmtr commented 3 years ago

Will close this, as original problem was not the problem. Have opened a new issue to clarify meaning of 'score' https://github.com/jina-ai/jina/issues/1255