themarshallproject / hall-of-justice

Working with criminal justice data.
http://hallofjustice.sunlightfoundation.com/
BSD 3-Clause "New" or "Revised" License
14 stars 7 forks source link

match query doesn't support search syntax, simple_query doesn't support synonyms #24

Closed dcloud closed 9 years ago

dcloud commented 9 years ago

From the note on multiword synonyms:

… because the query_string query supports a terse mini search-syntax, it could frequently lead to surprising results or even syntax errors.

One of the gotchas of this query involves multiword synonyms…

As part of this parsing process, it breaks up the query string on whitespace, and passes each word that it finds to the relevant analyzer separately. This means that your synonym analyzer will never receive a multiword synonym. Instead of seeing United States as a single string, the analyzer will receive United and States separately.

So our multi-word synonyms like "close management" will break because queries are parsed into tokens that won't match the synonym mapping.

How to fix?

dcloud commented 9 years ago

Mapping multi-word phrases to single-word phrases works some of the time:

"use force,officer-involved shooting,death custody,arrest-related death=>useofforce",

That appears to work for "death in custody" as it becomes useofforc. Does not work without quotation marks. Oddly, "officer-involved shooting" and others do not get this behavior.

dcloud commented 9 years ago

Perhaps the solution is to do away with the synonym mapping and do expansion on tags when they are put into the search index or something.

A prepare_tags method could check that items tagged with "close management" or "solitary housing unit" had "shu", etc. added when the index is created or updated. As long as the tag expansion happened, then the possibilities would be captured, with the downside being a larger search index.

dcloud commented 9 years ago

Wherein I record how I am dumb on Github for all to see. Essentially, there are additional points of override that SimpleESBackend should override because searches like q=juvenile justice are getting transformed into "(juvenile justice)" syntax (query string syntax) and we are passing that into a match search.

Stack for processing/building query that gets sent to build_search_kwargs:

dcloud commented 9 years ago

Possible solution in 66f4e469e629d8e91ed5725589a26e441a138aae.

dcloud commented 9 years ago

Calling this done.