Enable post-retrieval scoring of document-match pairs

maximilianwerk commented 4 years ago

Describe the feature

Currently jina only supports ranking matches based on the scores, that the retrieval step provides. Adding the possibility to add more query <> match metrics in order to fine-tune the ranking is needed. Possible applications are a simple edit distance or complex deep learning scoring techniques as BERT.

Furthermore, it might be necessary, to add a WeightedRanker, which takes multiple scores encoded in the score.operands field and combines them to one top-level score per match.

Proposal

Add a driver for unpacking the document and the matches (named ContentMatchDriver).
Add a executor interface for scoring the document with the matches (named ContentMatcher). This should be configurable to either overwrite the match score or add a score in the score.operands field.
Add a concrete implementation of the ContentMatcher in the form of a simple Levenshtein distance in the hub (named LevenshteinMatcher).
Add a concrete implementation of the ContentMatcher in the form of a BERT scoring in the hub (named `BertMatcher´).

Adding a WeightedRanker as a consecutive step might be necessary. This could be a simple linear-combination of the existing scores or something like lambda-mart in the long run. Anyhow, I would rather add this as a consecutive task, to not overload this issue.

┆Issue is synchronized with this Jira Task by Unito

JoanFM commented 4 years ago

Important to keep in mind, that Rankers should be chainable to allow different phases of ranking

sync-by-unito[bot] commented 4 years ago

➤ Nan Wang commented:

BertQA is an extractive QA model, which extract the answer from the context text. This is not extract what we want.

jina-ai / serve

Enable post-retrieval scoring of document-match pairs #891