Closed mistcat closed 3 years ago
Thanks for the report, @mistcat.
However I believe what is should call instead in order to correctly weight the fields as described is:
Can you provide more background information for this suggestion?
For context, $dismax->setBoostFunctions
was originally added in bfcc9fd8de9bc099e838a4e619acdd010c8cb7ff and became filterable with 003195d25622254bfc3e139154225ffddcf128e5
I've been in a Solr hole trying to figure out how to implement boosts for posts in certain post types and categories that led me to this issue.
While I am by no means a Solr expert, my research so far has led me to the same conclusion -- boost functions are not the right way to boost result scores based on term matches in post title / content fields. I'm having trouble finding the exact reason in the docs (and in general it's been difficult working with this plugin / Solr on Pantheon due to the outdated Solr version being light on docs) but you can use query debugging to show why.
Here's an example Solr query with (a simplified version of) the current Solr Power query: http://localhost:8983/solr/select?q.alt=test%20query&defType=dismax&fl=*,score&bf=post_title^2%20post_content^1.2&wt=json&debugQuery=true
Here's how this query is parsed according to debug.parsedquery
:
+(text:test text:queri) FunctionQuery((ord(post_title))^2.0) FunctionQuery((ord(post_content))^1.2)
Note that the FunctionQuery
wraps the field in the ord
function. As I understand it this is because function queries evaluate all non-numeric fields to numeric values (https://wiki.apache.org/solr/FunctionQuery#field). The ord
function produces a number that represents the numeric index of the query term in "lucene index order (lexicographically ordered by unicode value), starting at 1". This means the resulting boost value is arbitrarily determined by the position of the search term in a list of all values for that field (sorted alphabetically). The end result of using this approach:
981.05255
, compared to 0.2949645
without the bf
).Here's an alternate version that uses the qf
parameter as proposed:
http://localhost:8983/solr/select?q.alt=test%20query&defType=dismax&fl=*,score&qf=post_title^2%20post_content^1.2&wt=json&debugQuery=true
With this query the scores go back down to a normal range, but if you look at the debug.parsedquery
it's not quite right:
+(text:test text:queri)
It's not actually using our field boosts (it's using the default search field, as specified in the Solr schema). The culprit is the use of q.alt
instead of q
, a behavior that was introduced by adding this line in https://github.com/pantheon-systems/solr-power/pull/143/files.
Switching from q.alt
to q
fixes it:
http://localhost:8983/solr/select?q=test%20query&defType=dismax&fl=*,score&qf=post_title^2%20post_content^1.2&wt=json&debugQuery=true
Which results in a debug.parsedquery
of:
+((DisjunctionMaxQuery((post_title:test^2.0 | post_content:test^1.2)) DisjunctionMaxQuery((post_title:queri^2.0 | post_content:queri^1.2)))~2) ()
This produces results with scores in normal ranges, and if you look at the result fields the order can be linked back to our field boosts (i.e. if you change the boost factors, the scores change in expected ways).
The other thing to note here is that this query is only searching post title and content fields. With qf
you are telling it which fields to search, so you need to add back the default search field (the multi-valued text
field):
http://localhost:8983/solr/select?q=test%20query&defType=dismax&fl=*,score&qf=text%20post_title^2%20post_content^1.2&wt=json&debugQuery=true
+((DisjunctionMaxQuery((text:test | post_title:test^2.0 | post_content:test^1.2)) DisjunctionMaxQuery((text:queri | post_title:queri^2.0 | post_content:queri^1.2)))~2) ()
This produced the best results in my testing.
Here's what I think should happen:
q.alt
to q
for all queries except the ones that need to use q.alt
. I know it's needed to calculate the index stats; I'm not sure if anything else introduced in https://github.com/pantheon-systems/solr-power/pull/143 needs it.bf
to qf
for implementing field boosts. Example: text post_title^2 post_content^1.2
I'd be happy to submit a PR, but 1 is a bit of a blocker for me as I really don't understand why the switch to q.alt
was made in https://github.com/pantheon-systems/solr-power/pull/143. I'd also appreciate it if someone with more Solr experience weighed in here.
I'd be happy to submit a PR, but 1 is a bit of a blocker for me as I really don't understand why the switch to
q.alt
was made in #143. I'd also appreciate it if someone with more Solr experience weighed in here.
I imagine it was a "tried something else and it worked" implementation. Because it's only used for fetching stats, we could run an entirely separate Solr query for it, and ignore the code path used for the main Solr query.
I'd be fine with a pull request including your suggested changes as long as they also include a reasonable amount of test coverage around them.
Looks like this is a bit more involved than I originally thought. The reason that you're using q.alt
instead of q
because of the way all the filter queries are set up in SolrPower_WP_Query::build_query()
. For example, here's the query it builds to filter by post type:
star wars AND(post_type:post)
The dismax parser, being geared towards humans, doesn't support that syntax for filtering. The standard query parser, which is used for q.alt
, does and is therefore required.
I think the bigger question is whether we should use the standard query parser or dismax by default. My feeling is dismax, for all the reasons described in the docs overview. It's less powerful, but 99% of search users won't know the difference. It's also more forgiving / less error prone, which feels appropriate as well.
I don't currently have the bandwidth to PR a switch in default parser, but I imagine the approach would be switching to filter queries for post type / status / tax / meta filters (which I believe will also be more performant).
The one thing I do recommend is removing the $dismax->setBoostFunctions()
call / filter that started this issue. It's not doing what it says it's doing, and completely throws off relevancy for sites with more than a handful of posts. (The unit tests don't have enough posts to demonstrate this).
Can confirm that modifying solr_boost_query
with all kinds of different variations never changes the search results.
I don't think we should be using the q.alt at all - after a lot of debugging it doesn't seem to support the boost syntax. Also strongly recommend removing the post_title hack.
I think using q.alt was a misunderstanding of the note under "selecting all documents" at the following link - it only applies if you're trying to use standard solr syntax in a dismax query: https://solarium.readthedocs.io/en/stable/queries/select-query/building-a-select-query/components/dismax-component/
To get boosting working, I had to do the following. I'm planning on following up with a PR with these changes:
Comment out the part in class-solrpowr-api.php that uses q.alt:
function dismax_query( $query, $dismax ) {
// $dismax->setQueryAlternative( $query->getQuery() );
// $query->setQuery( '' );
return $query;
}
Comment out the post_title hack, especially since I've incorporated into my wordpress post type a "boosted keywords" field that I'm using in instances where I want the client to be able to influence search results. This is in class-solrpower-wp-query.php on line 185:
if ( $query->get( 's' ) ) {
//$extra['sort_before'][ "query({!dismax qf=post_title v='\"" . addslashes( $query->get( 's' ) ) . "\"'})" ] = 'desc';
}
Get rid of the default boostFunctions because it's the incorrect way to set them (this is in a custom plugin):
add_filter( 'solr_boost_query',
function ( $solr_boost_query ) {
return false;
}
, 10, 1 );
Add them in manually:
add_filter( 'solr_dismax_query',
function( $dismax, $query ) {
$dismax->setQueryFields('text^1.2 post_title^2 boosted_keywords^5');
}
, 10, 2 );
@jimkeller I have no idea how any of this works, but this made our search queries work a lot more reliably. Kudos.
I created a pull request at: https://github.com/pantheon-systems/solr-power/pull/495 to workaround this problem. The pull request allows you to define a constant SOLRPOWER_DISABLE_QUERY_ALT that will circumvent the code that causes the boost not to work.
So simply put the following in wp-config.php:
define('SOLRPOWER_DISABLE_QUERY_ALT', true);
Note that you will still have to set your own boost filter as shown:
add_filter( 'solr_dismax_query',
function( $dismax, $query ) {
$dismax->setQueryFields('text^1.2 post_title^2');
}
, 10, 2 );
Also attached is a patch against the master branch. wp-solrpower-disable-query-alt.patch.txt
Shipped the constant with v2.2.4. Thanks again for your pull request, @jimkeller !
So exciting to see this get sorted out! Great work @jimkeller & @mgburns ! I'm sorry I didn't have more time to help contribute a fix myself, but am glad to see all the lovely community support. Let me know if I can help roadtest anything or contribute to documentation for it!
in includes/class-solrpower-api.php
the plugin currently calls:
However I believe what is should call instead in order to correctly weight the fields as described is: