pantheon-systems / solr-power

A WordPress plugin to connect to Pantheon's Apache Solr search infrastructure, or your own!
https://wordpress.org/plugins/solr-power/
GNU General Public License v2.0
126 stars 61 forks source link

Dismax Boost may not be set correctly for weighting fields in queries #371

Closed mistcat closed 3 years ago

mistcat commented 6 years ago

in includes/class-solrpower-api.php

the plugin currently calls:

$solr_boost_query = apply_filters( 'solr_boost_query', 'post_title^2 post_content^1.2' );
            if ( false !== $solr_boost_query ) {
                $dismax->setBoostFunctions( $solr_boost_query );
            }

However I believe what is should call instead in order to correctly weight the fields as described is:

$solr_boost_query = apply_filters( 'solr_boost_query', 'post_title^2 post_content^1.2' );
            if ( false !== $solr_boost_query ) {
                $dismax->setQueryFields( $solr_boost_query );
            }
danielbachhuber commented 6 years ago

Thanks for the report, @mistcat.

However I believe what is should call instead in order to correctly weight the fields as described is:

Can you provide more background information for this suggestion?

For context, $dismax->setBoostFunctions was originally added in bfcc9fd8de9bc099e838a4e619acdd010c8cb7ff and became filterable with 003195d25622254bfc3e139154225ffddcf128e5

mgburns commented 6 years ago

I've been in a Solr hole trying to figure out how to implement boosts for posts in certain post types and categories that led me to this issue.

While I am by no means a Solr expert, my research so far has led me to the same conclusion -- boost functions are not the right way to boost result scores based on term matches in post title / content fields. I'm having trouble finding the exact reason in the docs (and in general it's been difficult working with this plugin / Solr on Pantheon due to the outdated Solr version being light on docs) but you can use query debugging to show why.

Here's an example Solr query with (a simplified version of) the current Solr Power query: http://localhost:8983/solr/select?q.alt=test%20query&defType=dismax&fl=*,score&bf=post_title^2%20post_content^1.2&wt=json&debugQuery=true

Here's how this query is parsed according to debug.parsedquery:

+(text:test text:queri) FunctionQuery((ord(post_title))^2.0) FunctionQuery((ord(post_content))^1.2)

Note that the FunctionQuery wraps the field in the ord function. As I understand it this is because function queries evaluate all non-numeric fields to numeric values (https://wiki.apache.org/solr/FunctionQuery#field). The ord function produces a number that represents the numeric index of the query term in "lucene index order (lexicographically ordered by unicode value), starting at 1". This means the resulting boost value is arbitrarily determined by the position of the search term in a list of all values for that field (sorted alphabetically). The end result of using this approach:

Here's an alternate version that uses the qf parameter as proposed: http://localhost:8983/solr/select?q.alt=test%20query&defType=dismax&fl=*,score&qf=post_title^2%20post_content^1.2&wt=json&debugQuery=true

With this query the scores go back down to a normal range, but if you look at the debug.parsedquery it's not quite right:

+(text:test text:queri)

It's not actually using our field boosts (it's using the default search field, as specified in the Solr schema). The culprit is the use of q.alt instead of q, a behavior that was introduced by adding this line in https://github.com/pantheon-systems/solr-power/pull/143/files.

Switching from q.alt to q fixes it: http://localhost:8983/solr/select?q=test%20query&defType=dismax&fl=*,score&qf=post_title^2%20post_content^1.2&wt=json&debugQuery=true

Which results in a debug.parsedquery of:

+((DisjunctionMaxQuery((post_title:test^2.0 | post_content:test^1.2)) DisjunctionMaxQuery((post_title:queri^2.0 | post_content:queri^1.2)))~2) ()

This produces results with scores in normal ranges, and if you look at the result fields the order can be linked back to our field boosts (i.e. if you change the boost factors, the scores change in expected ways).

The other thing to note here is that this query is only searching post title and content fields. With qf you are telling it which fields to search, so you need to add back the default search field (the multi-valued text field): http://localhost:8983/solr/select?q=test%20query&defType=dismax&fl=*,score&qf=text%20post_title^2%20post_content^1.2&wt=json&debugQuery=true

+((DisjunctionMaxQuery((text:test | post_title:test^2.0 | post_content:test^1.2)) DisjunctionMaxQuery((text:queri | post_title:queri^2.0 | post_content:queri^1.2)))~2) ()

This produced the best results in my testing.

Here's what I think should happen:

  1. Switch from q.alt to q for all queries except the ones that need to use q.alt. I know it's needed to calculate the index stats; I'm not sure if anything else introduced in https://github.com/pantheon-systems/solr-power/pull/143 needs it.
  2. Switch from bf to qf for implementing field boosts. Example: text post_title^2 post_content^1.2
  3. Consider removing the sort function that forces exact matches in post title to the top (introduced in https://github.com/pantheon-systems/solr-power/pull/272). Fixing the post title / content boosts should accomplish this in a way that doesn't disregard relevancy.

I'd be happy to submit a PR, but 1 is a bit of a blocker for me as I really don't understand why the switch to q.alt was made in https://github.com/pantheon-systems/solr-power/pull/143. I'd also appreciate it if someone with more Solr experience weighed in here.

danielbachhuber commented 6 years ago

I'd be happy to submit a PR, but 1 is a bit of a blocker for me as I really don't understand why the switch to q.alt was made in #143. I'd also appreciate it if someone with more Solr experience weighed in here.

I imagine it was a "tried something else and it worked" implementation. Because it's only used for fetching stats, we could run an entirely separate Solr query for it, and ignore the code path used for the main Solr query.

I'd be fine with a pull request including your suggested changes as long as they also include a reasonable amount of test coverage around them.

mgburns commented 6 years ago

Looks like this is a bit more involved than I originally thought. The reason that you're using q.alt instead of q because of the way all the filter queries are set up in SolrPower_WP_Query::build_query(). For example, here's the query it builds to filter by post type:

  star wars AND(post_type:post)

The dismax parser, being geared towards humans, doesn't support that syntax for filtering. The standard query parser, which is used for q.alt, does and is therefore required.

I think the bigger question is whether we should use the standard query parser or dismax by default. My feeling is dismax, for all the reasons described in the docs overview. It's less powerful, but 99% of search users won't know the difference. It's also more forgiving / less error prone, which feels appropriate as well.

I don't currently have the bandwidth to PR a switch in default parser, but I imagine the approach would be switching to filter queries for post type / status / tax / meta filters (which I believe will also be more performant).

The one thing I do recommend is removing the $dismax->setBoostFunctions() call / filter that started this issue. It's not doing what it says it's doing, and completely throws off relevancy for sites with more than a handful of posts. (The unit tests don't have enough posts to demonstrate this).

defunctl commented 5 years ago

Can confirm that modifying solr_boost_query with all kinds of different variations never changes the search results.

jimkeller commented 4 years ago

I don't think we should be using the q.alt at all - after a lot of debugging it doesn't seem to support the boost syntax. Also strongly recommend removing the post_title hack.

I think using q.alt was a misunderstanding of the note under "selecting all documents" at the following link - it only applies if you're trying to use standard solr syntax in a dismax query: https://solarium.readthedocs.io/en/stable/queries/select-query/building-a-select-query/components/dismax-component/

To get boosting working, I had to do the following. I'm planning on following up with a PR with these changes:

Comment out the part in class-solrpowr-api.php that uses q.alt:

function dismax_query( $query, $dismax ) {
  // $dismax->setQueryAlternative( $query->getQuery() );
  // $query->setQuery( '' );

  return $query;
}

Comment out the post_title hack, especially since I've incorporated into my wordpress post type a "boosted keywords" field that I'm using in instances where I want the client to be able to influence search results. This is in class-solrpower-wp-query.php on line 185:

if ( $query->get( 's' ) ) {
  //$extra['sort_before'][ "query({!dismax qf=post_title v='\"" . addslashes( $query->get( 's' ) ) . "\"'})" ] = 'desc';
}

Get rid of the default boostFunctions because it's the incorrect way to set them (this is in a custom plugin):

add_filter( 'solr_boost_query', 
  function ( $solr_boost_query ) {
    return false; 
   }
, 10, 1 );

Add them in manually:

add_filter( 'solr_dismax_query', 
  function( $dismax, $query ) {
    $dismax->setQueryFields('text^1.2 post_title^2 boosted_keywords^5');
  }
, 10, 2 );
mrobit commented 4 years ago

@jimkeller I have no idea how any of this works, but this made our search queries work a lot more reliably. Kudos.

jimkeller commented 3 years ago

I created a pull request at: https://github.com/pantheon-systems/solr-power/pull/495 to workaround this problem. The pull request allows you to define a constant SOLRPOWER_DISABLE_QUERY_ALT that will circumvent the code that causes the boost not to work.

So simply put the following in wp-config.php: define('SOLRPOWER_DISABLE_QUERY_ALT', true);

Note that you will still have to set your own boost filter as shown:

add_filter( 'solr_dismax_query', 
  function( $dismax, $query ) {
    $dismax->setQueryFields('text^1.2 post_title^2');
  }
, 10, 2 );

Also attached is a patch against the master branch. wp-solrpower-disable-query-alt.patch.txt

danielbachhuber commented 3 years ago

Shipped the constant with v2.2.4. Thanks again for your pull request, @jimkeller !

mistcat commented 3 years ago

So exciting to see this get sorted out! Great work @jimkeller & @mgburns ! I'm sorry I didn't have more time to help contribute a fix myself, but am glad to see all the lovely community support. Let me know if I can help roadtest anything or contribute to documentation for it!