mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Authority search results don't prioritise direct matches to quoted search terms #4426

Open crowbot opened 6 years ago

crowbot commented 6 years ago

dq6n_7gx0aaodtb

garethrees commented 6 years ago

This should really return direct matches at the top of the list, even if they're not quoted (https://github.com/mysociety/alaveteli/issues/1179#issuecomment-68866540).

screen shot 2017-12-28 at 10 22 29

gbp commented 6 years ago

Related to https://github.com/mysociety/alaveteli-professional/issues/349

lizconlan commented 6 years ago

Ok, so it looks like part of what's happening in the first case might be that it's searching on the notes field as well as the name so there are more matches for "City College Birmingham" and it ends up at the top with a higher weighting. (We're sorting on relevance by default.)

Local results without notes

screen shot 2018-01-30 at 17 42 26

Local results with the same notes as WDTK

screen shot 2018-01-30 at 17 44 15

lizconlan commented 6 years ago

The fix might be as simple as:

def select_authority
  …
-   @xapian_requests = typeahead_search(query, :model => PublicBody)
+   @xapian_requests = typeahead_search(query,
+                                       model: PublicBody,
+                                       prefix: 'name')

Which should ~both look in the correct field and~ [1] sort by value first, then relevance and get us back to:

screen shot 2018-01-30 at 18 04 03


[1] looks like I've either got the field name wrong here or made some other syntax error - todo: investigate the document structure in the index in order to test prioritising and/or searching only on the name field so that we can decide whether that's useful

lizconlan commented 6 years ago

Ah, looks like we're building all the PublicBody model fields into a single Term List which explains why our results are off (we also have V for variety and U for tag (I've not got tags set up which is why they're not appearing below)

e.g. (my dev box)

Term List for record #36: 1 2012 IPublicBody-11 MPublicBody Vauthority Za Zamp Zaugust Zbirmingham Zbodi Zciti Zcolleg Zcom Zform Zhref Zhttps Zmerg Zon Zsouth Zsouth_city_college_birmingham Zto Zwhatdotheyknow Zwith Zwww

reindexing with name added to terms in the model looks more promising:

  :terms => [
    [:variety, 'V', "variety"],
+   [:name, 'N', "name"],
    [:tag_array_for_search, 'U', "tag"]
  ],
Term List for record #10: 1 2012 IPublicBody-11 MPublicBody NCity College Birmingham Vauthority Za Zamp Zaugust Zbirmingham Zbodi Zciti Zcolleg Zcom Zform Zhref Zhttps Zmerg Zon Zsouth Zsouth_city_college_birmingham Zto Zwhatdotheyknow Zwith Zwww

...still to do - work out how to build the search query to use the name field (cheap alternative - reindex without the notes field)

lizconlan commented 6 years ago

Ok, this seems to work (although ActsAsXapian doesn't implement weighting - although it does have access to it via the standard ruby bindings - so implementing this may not be entirely straightforward).

1. Add a new term prefix to PublicBody

  :terms => [
    [:variety, 'V', "variety"],
+   [:name_array_for_search, 'N', 'name'],
    [:tag_array_for_search, 'U', "tag"]
  ],

  ...

+  def name_array_for_search
+    [
+      name.downcase,
+      name.downcase.split,
+      name,
+      name.split
+    ].flatten
+  end

So when we reindex we get:

Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody NCardiff NCardiff Council NCouncil Ncardiff Ncardiff council Ncouncil Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcardiff Zcouncil cardiff council

2. Add a subquery to the search (joined to the original user query with an OR operator), adding extra weight to the N term prefix lookup

So we should end up saying - "perform the original search but also look at the data we extracted from the name field and show it above the other results if you find anything".

e.g.

    model_query = Xapian::Query.new(Xapian::Query::OP_OR, model_classes.map { |mc| "M#{ mc }" })
    user_query = $query_parser.parse_query(
                           query_string,
                           Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE |
                           Xapian::QueryParser::FLAG_LOVEHATE |
                           Xapian::QueryParser::FLAG_SPELLING_CORRECTION)

    # add 20% weight to the N (name) term prefix - this effectively treats it as a phrase query/direct match
    user_n_query = Xapian::Query.new("N#{ query_string }")
    user_query2 = Xapian::Query.new(Xapian::Query::OP_SCALE_WEIGHT, user_n_query, 1.2)

    # join it with our original as an OR search
    user_query_combined = Xapian::Query.new(Xapian::Query::OP_OR, user_query, user_query2)

    # per original example, join as an AND to only search the PublicBody model data
    query = Xapian::Query.new(Xapian::Query::OP_AND, model_query, user_query_combined)

    # show the actual query being sent to Xapian
    p query.description
    # "Xapian::Query((MPublicBody AND ((cardiff:(pos=1) AND council:(pos=2)) OR 1.1999999999999999556 * Ncardiff council)))"

    limit = options[:limit] || 25
    offset = options[:offset] || 0
    sort_by_ascending = true

    $enquire.query = query
    $enquire.collapse_key = 1

    @matches = $enquire.mset(offset, limit, 100)

Original result ordering (on my dev box, 106 is the one we want to appear first):

[
  #<Xapian::Match:0x0000000ae8c718 @docid=2842>,
    @rank=0, @weight=9.846239003879624, @collapse_count=0, @percent=100>,
  #<Xapian::Match:0x0000000ae8c6a0 @docid=106>,
    @rank=1, @weight=9.579644964708582, @collapse_count=0, @percent=97>,
  #<Xapian::Match:0x0000000ae8c600 @docid=2857>,
    @rank=2, @weight=8.348687831525604, @collapse_count=0, @percent=84>,
  #<Xapian::Match:0x0000000ae8c588 @docid=2856>,
    @rank=3, @weight=7.668920438865873, @collapse_count=0, @percent=77>,
  #<Xapian::Match:0x0000000ae8c4e8 @docid=15631>,
    @rank=4, @weight=7.554302187758436, @collapse_count=1, @percent=76>,
  #<Xapian::Match:0x0000000ae8c470 @docid=15633>,
    @rank=5, @weight=7.554302187758436, @collapse_count=3, @percent=76>,
  #<Xapian::Match:0x0000000ae8c3f8 @docid=3480>,
    @rank=6, @weight=7.44743991794151, @collapse_count=0, @percent=75>,
  #<Xapian::Match:0x0000000ae8c380 @docid=1457>,
    @rank=7, @weight=5.539167927279307, @collapse_count=0, @percent=56>,
  #<Xapian::Match:0x0000000ae8c2e0 @docid=2540>,
    @rank=8, @weight=3.3380500227472685, @collapse_count=0, @percent=33>
] 

New result ordering (again, 106 is the one we're looking for):

  #<Xapian::Match:0x0000000a78d070 @docid=106,
    @rank=0, @weight=21.67781737965247, @collapse_count=0, @percent=100>,
  #<Xapian::Match:0x0000000a78cff8 @docid=2842,
    @rank=1, @weight=9.614087170710766, @collapse_count=0, @percent=44>,
  #<Xapian::Match:0x0000000a78cd78 @docid=2857,
    @rank=2, @weight=8.020960426820963, @collapse_count=0, @percent=37>,
  #<Xapian::Match:0x0000000a78ccd8 @docid=2856, 
    @rank=3, @weight=7.431409141698338, @collapse_count=0, @percent=34>,
  #<Xapian::Match:0x0000000a78cc10 @docid=15634,
    @rank=4, @weight=7.335637435965685, @collapse_count=3, @percent=33>,
  #<Xapian::Match:0x0000000a78c990 @docid=3480,
    @rank=5, @weight=7.140584030278554, @collapse_count=0, @percent=32>,
  #<Xapian::Match:0x0000000a78c8f0 @docid=15631,
    @rank=6, @weight=6.93390727730389, @collapse_count=1, @percent=31>,
  #<Xapian::Match:0x0000000a78c878 @docid=1457,
    @rank=7, @weight=5.539515976038405, @collapse_count=0, @percent=25>,
  #<Xapian::Match:0x0000000a78c800 @docid=2540,
    @rank=8, @weight=3.3383779390258526, @collapse_count=0, @percent=15>
]
garethrees commented 6 years ago

Noting down how to find xapian info for a particular public body:

# Grep for the database record id (`107`) to find the Xapian internal ID (`106`)
$ xapian-delve -d -t MPublicBody lib/acts_as_xapian/xapiandbs/production -1 | grep PublicBody-107
106 [PublicBody-107]

# Get data for the record
$ xapian-delve -d lib/acts_as_xapian/xapiandbs/production -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcardiff Zcouncil cardiff council

Note: Its just delve on Ubuntu

garethrees commented 6 years ago

Interesting. Currently can't get the direct match to the top in production. We have at least got it on to the first page though.

Development (Research Export)

screen shot 2018-03-22 at 14 52 57

Production

screen shot 2018-03-22 at 14 52 47

Even tried de-prioritising texts down to 200 with terms at 1:

diff --git a/app/models/public_body.rb b/app/models/public_body.rb
index 03b009023..4cbe5222b 100644
--- a/app/models/public_body.rb
+++ b/app/models/public_body.rb
@@ -113,7 +113,7 @@ class PublicBody < ActiveRecord::Base
                    [:created_at_numeric, 1, "created_at", :number]
                  ],
                  :terms => [
-                   [:name_for_search, 'N', 'name'],
+                   [:name_for_search, 'N', 'name', 1],
                    [:variety, 'V', "variety"],
                    [:tag_array_for_search, 'U', "tag"]
                  ],
diff --git a/lib/acts_as_xapian/acts_as_xapian.rb b/lib/acts_as_xapian/acts_as_xapian.rb
index 85adf4772..e9b92ef3c 100644
--- a/lib/acts_as_xapian/acts_as_xapian.rb
+++ b/lib/acts_as_xapian/acts_as_xapian.rb
@@ -960,7 +960,7 @@ module ActsAsXapian
           # default to a relatively low weight to give us flexibility either
           # side.
           xapian_value = xapian_value(text, nil, true)
-          ActsAsXapian.term_generator.index_text(xapian_value, 100)
+          ActsAsXapian.term_generator.index_text(xapian_value, 200)
         end
       end
garethrees commented 6 years ago

Ahhh, I had an empty #name attribute for the cy translation for Cardiff Council. I had translations for other authorities, so not sure what happened there.

$ delve -d lib/acts_as_xapian/xapiandbs/development -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ncardiff council Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcardiff Zcouncil cardiff council

$ xapian-delve -d lib/acts_as_xapian/xapiandbs/production -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ncardiff council Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcaerdydd Zcardiff Zcouncil Zcyngor caerdydd cardiff council cyngor

Now I get Cardiff Council appearing third in the list in development:

screen shot 2018-03-22 at 15 09 57

$ delve -d lib/acts_as_xapian/xapiandbs/development -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ncardiff council Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcaerdydd Zcardiff Zcouncil Zcyngor caerdydd cardiff council cyngor
garethrees commented 6 years ago

For reference https://stackoverflow.com/questions/11966602/how-to-use-the-term-position-parameter-in-xapian-query-constructors

garethrees commented 6 years ago

Some searches are now not returning results when not capitalised:

screen shot 2018-04-03 at 09 32 31

screen shot 2018-04-03 at 09 32 13

This is not universal:

screen shot 2018-04-03 at 10 35 09

crowbot commented 6 years ago

Just to note that when this originally occurred, it seemed like capitalization might not be the issue - might instead be some transient issue with search results.

garethrees commented 6 years ago

Yeah, there's something strange going on.

If you search for "Home Office", exit the search (hit escape or click elsewhere), then search for "home office" you get a result:

screen shot 2018-04-03 at 09 32 58

garethrees commented 6 years ago

Through the standard search, both "Home Office" and "home office" have the result appearing on page 3:

garethrees commented 6 years ago

TypeaheadSearch does find the authority:

search_results =
  TypeaheadSearch.new('home office', model: PublicBody, per_page: 25, page: 1).
  xapian_search.
  results

pp search_results.map { |r| r[:model] }.map(&:name);nil
["International Policing Assistance Board",
 "Police Remuneration Review Body",
 "National Policing Improvement Agency",
 "Tenant Services Authority",
 "National Fraud Authority",
 "Home Office Science Advisory Committee",
 "Stabilisation Unit",
 "Internet Watch Foundation",
 "Independent Anti-Slavery Commissioner",
 "The Police ICT Company Limited",
 "National Crime Agency",
 "Home Office",                                # <<<<<<<<<<<<<<<<<<<<<<<<<<
 "Aire Valley Homes Leeds",
 "East North East Homes Leeds",
 "West North West Homes Leeds",
 "Homes and Communities Agency",
 "CityWest Homes Limited",
 "Ascham Homes",
 "Homes for Northumberland",
 "Bolton at Home Limited",
 "Homes for Islington Limited",
 "Hounslow Homes",
 "Lewisham Homes",
 "Sandwell Homes",
 "Sheffield Homes Limited"]
garethrees commented 6 years ago

Ah…

garethrees commented 6 years ago

Though this doesn't seem to matter locally…

screen shot 2018-04-03 at 11 22 33

garethrees commented 6 years ago

We're passing a URL encoded string to Xapian:

Xapian query (0.00184s) Search: home%20office
garethrees commented 6 years ago

If we drop the encodeURIComponent we end up with the search term that we'd expect:

+++ b/app/assets/javascripts/alaveteli_pro/authority_select.js
@@ -88,9 +88,10 @@
       },
       load: function(query, callback) {
         if (!query.length) return callback();
         $.getJSON(
           searchUrl,
-          { query: encodeURIComponent(query) },
+          { query: query },
           function (data) {
             callback(data)
             $selectizeInstance.trigger('type')

What we send:

XHR finished loading: GET "http://10.10.10.30:3000/alaveteli_pro/public_bodies?query=home+ofice".

The Xapian search:

 Xapian query (0.00073s) Search: home office
garethrees commented 6 years ago

Alternatively, we can explicitly decode on the server side:

diff --git a/app/controllers/alaveteli_pro/public_bodies_controller.rb b/app/controllers/alaveteli_pro/public_bodies_controller.rb
index 45735f81f..cf27380dc 100644
--- a/app/controllers/alaveteli_pro/public_bodies_controller.rb
+++ b/app/controllers/alaveteli_pro/public_bodies_controller.rb
@@ -3,7 +3,8 @@ class AlaveteliPro::PublicBodiesController < AlaveteliPro::BaseController
   include AlaveteliPro::PublicBodiesHelper

   def index
-    query = params[:query] || ""
+    query = URI.decode(params[:query]) || ""
     xapian_results = typeahead_search(query, :model => PublicBody,
                                              :exclude_tags => [ 'defunct',
                                                                 'not_apply' ])

What we send:

XHR finished loading: GET "http://10.10.10.30:3000/alaveteli_pro/public_bodies?query=home%2520office".

The Xapian search:

Xapian query (0.00133s) Search: home office
garethrees commented 6 years ago

There's obviously some weird double encoding happening somewhere in this chain.

Its not clear that this is the problem yet though.

garethrees commented 6 years ago

Passing home%20office to the search definitely seems to be preventing us finding the authority in production:

search_results =
  TypeaheadSearch.new('home%20office', model: PublicBody, per_page: 25, page: 1).
  xapian_search.
  results

pp search_results.map { |r| r[:model] }.map(&:name);nil
["Aire Valley Homes Leeds",
 "East North East Homes Leeds",
 "West North West Homes Leeds",
 "Homes and Communities Agency",
 "CityWest Homes Limited",
 "Ascham Homes",
 "Homes for Northumberland",
 "Bolton at Home Limited",
 "Homes for Islington Limited",
 "Hounslow Homes",
 "Lewisham Homes",
 "Sandwell Homes",
 "Sheffield Homes Limited",
 "Stockport Homes Limited",
 "Air Accidents Investigation Branch",
 "English Partnerships",
 "Home-Grown Cereals Authority",
 "National Policing Improvement Agency",
 "Tristar Homes Limited",
 "Your Homes Newcastle",
 "Tenant Services Authority",
 "Ashfield Homes",
 "Barnet Homes",
 "Berneslai Homes",
 "Cheltenham Borough Homes"]

Works fine in development though – but haven't tried against full authority list yet.

garethrees commented 6 years ago

Just noting here that I wonder whether we need to add optional weight values to each indexed texts as well as terms, e.g.:

  acts_as_xapian :texts => [
                   [:name, 1],
                   [:short_name, 3], 
                   [:notes, 50]
                 ],
                 :values => [
                   # for sorting
                   [:created_at_numeric, 1, "created_at", :number]
                 ],
                 :terms => [
                   [:name_for_search, 'N', 'name'],
                   [:variety, 'V', "variety"],
                   [:tag_array_for_search, 'U', "tag"]
                 ],
                 :eager_load => [:translations]
garethrees commented 6 years ago

I think the underlying issue here is https://github.com/mysociety/alaveteli/issues/723.

garethrees commented 6 years ago

I should define the difference between a term and a text.

For terms, we add the text to be indexed directly to the Xapian::Document:

doc.add_term(term[1] + value)
doc.add_posting(term[1] + value, 1, Integer(term[3])) if term[3]

For texts, we run the text to be indexed through a term_generator:

ActsAsXapian.term_generator.index_text(xapian_value, 100)

The TermGenerator FAQ covers this difference. The TL;DR is that running text through a TermGenerator performs stemming and splitting words in to separate terms. Adding directly to the Xapian::Document does none of these and adds the text as-is.

RichardTaylor commented 6 years ago

A user wrote to WhatDoTheyKnow after failing to find NHS England on the site.

A search for NHS England without quotes doesn't return NHS England on the first page of search results as might reasonably be expected. (It's currently on page 3)

MattK1234 commented 5 years ago

Just to note that a user contacted the WhatDoTheyKnow support team regarding this matter.

"I tried searching for "Home Office" and "The Home Office" in the find an authority search box initially but it fails to find any home office category at all."

The user found an alternative route to accessing the Home Office authority page (via 'view all authorities' and then refining).

mdeuk commented 5 years ago

Just to note that we've had another WhatDoTheyKnow user contact us -this time asking that we add the Home Office as a new body.

I've reproduced a search on WDTK this morning and can see that Alaveteli is showing four pages of search results, with the Home Office being the last entry on the last page. 😥

Steps to reproduce

  1. Enter the query "Home Office", sans quotes, into any search box (e.g. top of the portal, or, homepage)
  2. Notice that (at time of writing), Alaveteli generates search results for "Public authorities 1 to 5 of 18", with the first body being "The Police ICT Company Limited" (13024)
  3. Notice that the last body listed, on page 4, is the "Home Office" (7) image

Reproducibility

Always.

Another notable entry is the DWP - searching for their acronym (which is their configured short name), the body "Department for Work and Pensions" (24) is fifth on the list. The first entry is Motability (64685), whose only reference to the DWP acronym appears to be contained in a hyperlink within the 'Notes'.

FOIMonkey commented 2 years ago

A user has contacted us to point out that searching for the ICO doesn't bring up the ICO. https://www.whatdotheyknow.com/body/list/all?utf8=%E2%9C%93&public_body_query=information+commissioner&commit=Search

FOIMonkey commented 2 years ago

A user has written asking us to add the CAA, having failed to find them listed.

WilliamWDTK commented 2 years ago

I've not been able to re-produce the particular CAA problem.

At the moment, for me, a "everything" search for "CAA" brings up the CAA as the second authority: image

The same search for "Civil Aviation Authority" does the same: image Changing the order from "newest" to "most relevant" reverses the order: image

The "make a request" search also returns it: image

HelenWDTK commented 1 year ago

+1 I've dealt with a couple of user support cases from users who have not found the authority they were looking for when using the site search due to this issue.

HelenWDTK commented 1 year ago

+1 A user has written to us saying that a search for NHS England didn't return the authority in the first 5 pages of search results.

garethrees commented 11 months ago

We didn't explicitly speak about it, but there's general feeling that removing notes from the indexable content would be worth doing to improve the situation here. We should try just removing notes from tags as a first step, as directly applied notes are often used to note alternate names for authorities.

dracos commented 9 months ago

In the first comment an issue there is because searches with capital letters in all search words default to sorting by date, not relevance. In all lower case, cardiff council is number 2, still not perfect, but still better, and also deals with the issue with DWP. I don't see why searches with capital letters in every word should change the default sort order, PR at https://github.com/mysociety/alaveteli/pull/8109 Doesn't help with bodies with notes being returned over the body itself, the other main issue raised I can see, but don't see any downsides and is certainly more consistent behaviour.