Closed janandreschweiger closed 3 years ago
Hey @janandreschweiger .
First of all. What is your strategy to compute the matches scores once the chunks score is computed?
For instance we have MinRanker that assigns the match document the score corresponding to the min scored (min distance chunk) or some other strategies.
If you have a strategy like this one, what you need is Chunk2DocRankDriver and not CollectMatches2DocRankDriver. The concept is similar but there is a slight difference.
I hope this helps.
Another solution is to use RevertQLDriver following your Driver to revert the score.
Thank you @JoanFM, I had a look on this. We currently use the SimpleAggregateRanker with strategy="min", which corresponds to the MinRanker (so far I know).
However, I think the issue remains the same:
class Chunk2DocRankDriver(BaseRankDriver):
# other code
def _apply_all(self, docs: 'DocumentSet', context_doc: 'Document', *args, **kwargs) -> None:
# other code
docs_scores = self.exec_fn(match_idx, query_chunk_meta, match_chunk_meta) # the grouped chunks are not returned again
# should return [(doc_id, score, chunks), ...] but returns [(doc_id, score), ...]
op_name = exec.__class__.__name__
for int_doc_id, score in docs_scores: # cannot add chunks, because they are not returned by the score function of the ranker:
m = Document(id=int_doc_id)
m.score.value = score
m.score.op_name = op_name
context_doc.matches.append(m)
https://docs.jina.ai/_modules/jina/drivers/rank.html#Chunk2DocRankDriver
For getting the matching chunks (not of the query, but of the indexed documents), the Chunk2DocRanker.score method must return the chunks of each indexed document. The Chunk2DocRanker is the base class for both MinRanker and SimpleAggregateRanker.
https://docs.jina.ai/_modules/jina/executors/rankers.html#Chunk2DocRanker
Thank you @JoanFM, I had a look on this. We currently use the SimpleAggregateRanker with strategy="min", which corresponds to the MinRanker (so far I know).
However, I think the issue remains the same:
class Chunk2DocRankDriver(BaseRankDriver): # other code def _apply_all(self, docs: 'DocumentSet', context_doc: 'Document', *args, **kwargs) -> None: # other code docs_scores = self.exec_fn(match_idx, query_chunk_meta, match_chunk_meta) # the grouped chunks are not returned again # should return [(doc_id, score, chunks), ...] but returns [(doc_id, score), ...] op_name = exec.__class__.__name__ for int_doc_id, score in docs_scores: # cannot add chunks, because they are not returned by the score function of the ranker: m = Document(id=int_doc_id) m.score.value = score m.score.op_name = op_name context_doc.matches.append(m)
For getting the matching chunks (not of the query, but of the indexed documents), the Chunk2DocRanker.score method must return the chunks of each indexed document. The Chunk2DocRanker is the base class for both MinRanker and SimpleAggregateRanker.
Yes @janandreschweiger,
the SimpleAggregateRanker with strategy min is equivalent to MinRanker.
So are your problems, that the information of chunks is lost? And that the order in the output is not as expected?
Would you be able to provide a minimal example with a very minimal sample on how 2 documents would be indexed and what expected in return?
That would be very helpful to better understand the usecase.
Thanks a lot
@JoanFM Of course, I will do that.
The order and the score are fine. I solved these issues some days ago. So yes, our problem is that the chunks (=paragraphs) are lost.
I will provide you with an example soon.
Hi @JoanFM! Thanks again for having a look at our problem. I have build a minimal working example: https://github.com/janandreschweiger/jina_example
Our problem is that there are no chunks at app.py line 23. You can run it yourself:
python app.py index
python app.py query
Are there any good tutorials?
I will come back here tomorrow. Good night!
Hi @JoanFM! Thanks again for having a look at our problem. I have build a minimal working example: https://github.com/janandreschweiger/jina_example
Our problem is that there are no chunks at app.py line 23. You can run it yourself:
python app.py index python app.py query Are there any good tutorials?
I will come back here tomorrow. Good night!
First thing you can do is to remove the chunks from the fields excluded in 'ExcludeQL' That would let you have the chunks at the output. But I guess that what you would like to do is to keep the scores of the chunks for which it matched?
@JoanFM Exactly, we use the ExcludeQL only at index time. We need the scores of the matching chunks. We just want to show the best matching chunk to the user similar to Google.
Again one can easily overwrite the driver as described above. But additionally, the Chunk2DocRanker.score function, which groups the chunks to its documents, must also return the matching chunks. And I failed to do that.
Hi @janandreschweiger and @JoanFM! This feature is also of interest to me. Thanks for creating a draft @JoanFM.
Hey @JoanFM and @ace-kay-law-neo , the PR #1494 PR provides this feature, but be aware that there is a change in naming of the Collect2MatchesRankDriver
Wow awesome, thank your very much @JoanFM!
Hey @JoanFM, your driver works like a charm. Thank you!
You may want to rename the keep_old_matches_as_chunks in the docs, because it is outdated: https://docs.jina.ai/api/jina.drivers.rank.aggregate.html?highlight=keep_old_matches_as_chunks
Hi, Jina team!
Feature We use jina for searching short and long text documents. We split our documents into chunks (paragraphs) at index time. Our problem is now that we want to show the best paragraph (=chunk) to the user similar to Google:
Problem For ranking the best documents based on the matching chunks, one uses
Solution I tried to implement a custom driver and a custom executor which inherit from the classes mentioned above. Its easy to change the CollectMatches2DocRankDriver as one can see below:
The problem is however that the self.exec_fn above must also return the chunks of the document. Therefore one has to alter the Chunk2DocRanker's score function (which is the exec_fn at runtime). Unfortunately, I was unable to successfully implement that.
Intention I posted this request, because most production-ready search engines have such a feature. So others may be interested as well. I totally understand you if you currently do not have time for implementing it. However, it would be great if someone could take a closer look on this and tell us how we could do this ourselves.
Thank you.