Investigate sources of errors for boolean changes in sw-dev index

cmharlow commented 6 years ago

See feigenbaum analysis for this.

cbeer commented 6 years ago

Info from feigenbaum:

Bess asked me to look into the problem further, with an eye towards fixing the problem for Feigenbaum and our other existing uses for boolean-based searching.

tldir:

we can quickly do something kinda hacky that probably “works" and might not cause other problems (and spending time doing user acceptance testing would improve our confidence in this);
the next best thing is to add blacklight_advanced_search; it’ll take a little more developer-effort but has fewer side-effects, and brings it in line with SearchWorks behavior.
the best possible thing takes even more time and effort, but would actually fix many problems with the current boolean query behavior in edismax, and we might be able to trick interested partners into helping.
adding blacklight_advanced_search is probably the right way to get accurate boolean searches.

—

Prior to Solr 5.5, whenever the edismax query parser saw a boolean operator (AND/OR/NOT), it parsed the entire query as a boolean query, leading to some surprising results for some types of queries, particularly with the NOT operator. In Solr 5.5 (as part of https://issues.apache.org/jira/browse/SOLR-2649), they adjusted this behavior so the minimum match parameter (mm), if provided, still controls how the search terms are matched.

—

In https://jirasul.stanford.edu/jira/browse/VUF-1387, we collected a variety of queries that were affected by this bug, and presumed fixed by this change, including, e.g.:

digestive organs NOT disease

Pre-5.5, this was parsed as "digestive AND organs NOT disease”

Post-5.5, this is parsed (essentially) as a query for "digestive organs", but excluding those results with “disease”, probably better fitting the user’s expectation.

Collateral damage from this change was boolean OR queries also respect the mm parameter, so e.g.:

CRYSALIS PUFF OR MYCIN OR HASP

Pre-5.5, this was parsed as expected*, retrieving all documents with any of those terms.

Post-5.5, this is now parsed as (crysalis AND (any two of: puff or mycin or hasp)).

—

One quick fix we can apply to Feigenbaum/exhibits.stanford.edu is to add another parameter to the query we send to solr, to invoke the previous behavior when we detect a boolean OR-like query. The biggest danger to this approach is a significant risk of invoking the legacy behavior in ways that produce generally worse queries**. I’ve done some analysis using the Searchworks index relevancy tests for boolean queries and this fix appears to produce results similar to the previous behavior***, while not impacting boolean queries that were improved by SOLR-2649. Adding this to exhibits isn’t particularly difficult, but deserves significant user acceptance testing to corroborate my query analysis work.

Even with this quick fix, however, we should be aware that Solr’s support for parsing boolean algebra from user input has a very questionable history of producing useful and expected behavior:

http://robotlibrarian.billdueber.com/solr-and-boolean-operators/ - https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/

In SearchWorks, it appears we’ve sidestepped this by only advertising boolean operators on the advanced search page where user input is munged using jrochkind’s query pre-parser to produce expected queries. Perhaps a better fix for Feigenbaum/Exhibits is to add advanced search and instruct Scott to use it for his boolean searching needs, as it will produce higher quality results for boolean queries and avoid introducing regressions in other boolean queries. Adding the advanced search gem to exhibits might need a little design work, and might take a day to get it to play nice within an exhibit context.

Possibly the best option, most likely to produce expected behavior, support arbitrary boolean queries from a simple search box, and insulate us from these woes, is to create a new solr query parser to parse boolean-based queries in sane and expected ways (maybe modeled off jrochkind’s work in ruby). With a decent test suite (which, I’d note, is lacking from the edismax parser), we can better protect ourselves from regressions like that introduced in SOLR-2649. This is significant work (although likely suited for an outside contractor), but would fix our woes and might be a good contribution to the Solr community. Until then, we can only pile on hacks to try to emulate certain boolean-like behaviors.

Finally, I have questions into Bill Dueber and Erik Hatcher looking for their input on the problem, or if either of them are aware of alternative approaches, but have yet to hear back from them.

Unless I get new information from Bill or Erik, implementing advanced search into exhibits seems like the best available option, as we can avoid introducing new unexpected behavior, and can generally improve the quality of boolean searching for users like Scott.

Not quite true, either.. in reality, this is “crysalis OR puff OR mycin OR hasp”; the user probably intended “crysalis puff OR mycin OR hasp)

** although hopefully mitigated by the fact that a very, very small set of users will ever make boolean queries

*** for CRYSALIS PUFF OR MYCIN OR HASP, we still can’t reproduce the previous behavior with the same query string; we’ll need to fix the query anyway to what want actually intended e.g. “(CRYSALIS PUFF) OR MYCIN OR HASP”

cbeer commented 6 years ago

And we should note the different sw_index_tests that now /pass/ because of this change.

sul-dlss / sw_index_tests

Investigate sources of errors for boolean changes in sw-dev index #172