Open fulmicoton opened 6 years ago
Is the best way to achieve this right now a disjunction of a PhraseQuery and individual TermQueries?
E.g. in my search engine I would search for Michael Jackson
, which is logically Michael OR Jackson
. If I add a phrase to the equation like "Michael Jackson"^3 OR Michael OR Jackson
, I would acheive a higher score for the exact phrase match.
But is the idea really about dynamically increasing score based on the proximity of the next term to the previous term ("term clustering")? E.g. 3 indexed texts Michael foo Jackson
, Michael foo bar Jackson
, Michael foo bar baz Jackson
, the first one would score slightly better than the others.
I do not know which flavor is the best... I don't think this is a one size fits all problem, so differnt user will ask for different things.
FYI... Maybe there is a corresponding issue on Lucene: https://issues.apache.org/jira/browse/LUCENE-3320 I've started to investigate it; I'd like to help or give feedback here if it works on Lucene.
Sweet ! Looking forward to read your progress @mocobeta!
Conceptually it would just look like the phrasequery
You could base yourself on the phrase query with slop: Multiply the weight by the length of the query divided by the avg length of the matches or something like this ? So the more slop you need for a given match, the less it contributes to the score.
I'm hoping to use quickwit/tantivy to search fairly long documents, so proximity boosting is a must for sensible results. Is this still not possible? Solr/lucene seems to support in some ways at least. Here is one:
improves proximity boosting by using word shingles
https://solr.apache.org/guide/solr/latest/query-guide/edismax-query-parser.html
you will have to implement your own Query/Weight/Scorer.
Michael Jackson should get a better score if the two terms appear close to each other in the same document.
It would be nice to have a query that computes something equivalent to
Conceptually it would just look like the phrasequery, except that the part that checks whether the phrase is present would only affect the score of the document.