Show best chunks of found document

janandreschweiger commented 3 years ago

Hi, Jina team!

Feature We use jina for searching short and long text documents. We split our documents into chunks (paragraphs) at index time. Our problem is now that we want to show the best paragraph (=chunk) to the user similar to Google:

Problem For ranking the best documents based on the matching chunks, one uses

a subclass of the Chunk2DocRanker
the CollectMatches2DocRankDriver Unfortunately, the matching chunks (=paragraphs) are not stored in the chunks property of the returned documents.

Solution I tried to implement a custom driver and a custom executor which inherit from the classes mentioned above. Its easy to change the CollectMatches2DocRankDriver as one can see below:

class CollectMatches2DocRankDriver(BaseRankDriver):
  # ... uninteresting code
  def _apply_all(self, docs: 'DocumentSet', context_doc: 'Document', *args, **kwargs) -> None:
    # ... uninteresting code

    # self.exec_fn should return [(doc_id, score, chunks), ...] but returns [(doc_id, score), ...]
    docs_scores = self.exec_fn(match_idx, query_chunk_meta, match_chunk_meta)  

    context_doc.ClearField('matches')
    op_name = exec.__class__.__name__
    for int_doc_id, score, chunks in docs_scores:
        m = Document(id=int_doc_id)
        m.score.value = score
        m.score.op_name = op_name
        m.chunks = chunks  # add chunks
        context_doc.matches.append(m)

The problem is however that the self.exec_fn above must also return the chunks of the document. Therefore one has to alter the Chunk2DocRanker's score function (which is the exec_fn at runtime). Unfortunately, I was unable to successfully implement that.

Intention I posted this request, because most production-ready search engines have such a feature. So others may be interested as well. I totally understand you if you currently do not have time for implementing it. However, it would be great if someone could take a closer look on this and tell us how we could do this ourselves.

Thank you.

JoanFM commented 3 years ago

Hey @janandreschweiger .

First of all. What is your strategy to compute the matches scores once the chunks score is computed?

For instance we have MinRanker that assigns the match document the score corresponding to the min scored (min distance chunk) or some other strategies.

If you have a strategy like this one, what you need is Chunk2DocRankDriver and not CollectMatches2DocRankDriver. The concept is similar but there is a slight difference.

I hope this helps.

Another solution is to use RevertQLDriver following your Driver to revert the score.

janandreschweiger commented 3 years ago

Thank you @JoanFM, I had a look on this. We currently use the SimpleAggregateRanker with strategy="min", which corresponds to the MinRanker (so far I know).

However, I think the issue remains the same:

class Chunk2DocRankDriver(BaseRankDriver):
# other code
def _apply_all(self, docs: 'DocumentSet', context_doc: 'Document', *args, **kwargs) -> None:
  # other code
  docs_scores = self.exec_fn(match_idx, query_chunk_meta, match_chunk_meta) # the grouped chunks are not returned again
  #  should return [(doc_id, score, chunks), ...] but returns [(doc_id, score), ...]  

  op_name = exec.__class__.__name__
  for int_doc_id, score in docs_scores:  # cannot add chunks, because they are not returned by the score function of the ranker:
      m = Document(id=int_doc_id)
      m.score.value = score
      m.score.op_name = op_name
      context_doc.matches.append(m)

https://docs.jina.ai/_modules/jina/drivers/rank.html#Chunk2DocRankDriver

For getting the matching chunks (not of the query, but of the indexed documents), the Chunk2DocRanker.score method must return the chunks of each indexed document. The Chunk2DocRanker is the base class for both MinRanker and SimpleAggregateRanker.

https://docs.jina.ai/_modules/jina/executors/rankers.html#Chunk2DocRanker

JoanFM commented 3 years ago

Thank you @JoanFM, I had a look on this. We currently use the SimpleAggregateRanker with strategy="min", which corresponds to the MinRanker (so far I know).

However, I think the issue remains the same:
class Chunk2DocRankDriver(BaseRankDriver):
# other code
def _apply_all(self, docs: 'DocumentSet', context_doc: 'Document', *args, **kwargs) -> None:
  # other code
  docs_scores = self.exec_fn(match_idx, query_chunk_meta, match_chunk_meta) # the grouped chunks are not returned again
  #  should return [(doc_id, score, chunks), ...] but returns [(doc_id, score), ...]  

  op_name = exec.__class__.__name__
  for int_doc_id, score in docs_scores:  # cannot add chunks, because they are not returned by the score function of the ranker:
      m = Document(id=int_doc_id)
      m.score.value = score
      m.score.op_name = op_name
      context_doc.matches.append(m)
For getting the matching chunks (not of the query, but of the indexed documents), the Chunk2DocRanker.score method must return the chunks of each indexed document. The Chunk2DocRanker is the base class for both MinRanker and SimpleAggregateRanker.

Yes @janandreschweiger,

the SimpleAggregateRanker with strategy min is equivalent to MinRanker.

So are your problems, that the information of chunks is lost? And that the order in the output is not as expected?

Would you be able to provide a minimal example with a very minimal sample on how 2 documents would be indexed and what expected in return?

That would be very helpful to better understand the usecase.

Thanks a lot

janandreschweiger commented 3 years ago

@JoanFM Of course, I will do that.

The order and the score are fine. I solved these issues some days ago. So yes, our problem is that the chunks (=paragraphs) are lost.

I will provide you with an example soon.

janandreschweiger commented 3 years ago

Hi @JoanFM! Thanks again for having a look at our problem. I have build a minimal working example: https://github.com/janandreschweiger/jina_example

Our problem is that there are no chunks at app.py line 23. You can run it yourself:

python app.py index
python app.py query
Are there any good tutorials?

I will come back here tomorrow. Good night!

JoanFM commented 3 years ago

Hi @JoanFM! Thanks again for having a look at our problem. I have build a minimal working example: https://github.com/janandreschweiger/jina_example

Our problem is that there are no chunks at app.py line 23. You can run it yourself:
python app.py index
python app.py query
Are there any good tutorials?
I will come back here tomorrow. Good night!

First thing you can do is to remove the chunks from the fields excluded in 'ExcludeQL' That would let you have the chunks at the output. But I guess that what you would like to do is to keep the scores of the chunks for which it matched?

janandreschweiger commented 3 years ago

@JoanFM Exactly, we use the ExcludeQL only at index time. We need the scores of the matching chunks. We just want to show the best matching chunk to the user similar to Google.

Again one can easily overwrite the driver as described above. But additionally, the Chunk2DocRanker.score function, which groups the chunks to its documents, must also return the matching chunks. And I failed to do that.

ace-kay-law-neo commented 3 years ago

Hi @janandreschweiger and @JoanFM! This feature is also of interest to me. Thanks for creating a draft @JoanFM.

JoanFM commented 3 years ago

Hey @JoanFM and @ace-kay-law-neo , the PR #1494 PR provides this feature, but be aware that there is a change in naming of the Collect2MatchesRankDriver

janandreschweiger commented 3 years ago

Wow awesome, thank your very much @JoanFM!

janandreschweiger commented 3 years ago

Hey @JoanFM, your driver works like a charm. Thank you!

You may want to rename the keep_old_matches_as_chunks in the docs, because it is outdated: https://docs.jina.ai/api/jina.drivers.rank.aggregate.html?highlight=keep_old_matches_as_chunks

jina-ai / serve

Show best chunks of found document #1491