Open crowbot opened 6 years ago
This should really return direct matches at the top of the list, even if they're not quoted (https://github.com/mysociety/alaveteli/issues/1179#issuecomment-68866540).
Ok, so it looks like part of what's happening in the first case might be that it's searching on the notes field as well as the name so there are more matches for "City College Birmingham" and it ends up at the top with a higher weighting. (We're sorting on relevance by default.)
The fix might be as simple as:
def select_authority
…
- @xapian_requests = typeahead_search(query, :model => PublicBody)
+ @xapian_requests = typeahead_search(query,
+ model: PublicBody,
+ prefix: 'name')
Which should ~both look in the correct field and~ [1] sort by value first, then relevance and get us back to:
[1] looks like I've either got the field name wrong here or made some other syntax error - todo: investigate the document structure in the index in order to test prioritising and/or searching only on the name field so that we can decide whether that's useful
Ah, looks like we're building all the PublicBody
model fields into a single Term List which explains why our results are off (we also have V for variety and U for tag (I've not got tags set up which is why they're not appearing below)
e.g. (my dev box)
Term List for record #36: 1 2012 IPublicBody-11 MPublicBody Vauthority Za Zamp Zaugust Zbirmingham Zbodi Zciti Zcolleg Zcom Zform Zhref Zhttps Zmerg Zon Zsouth Zsouth_city_college_birmingham Zto Zwhatdotheyknow Zwith Zwww
reindexing with name
added to terms in the model looks more promising:
:terms => [
[:variety, 'V', "variety"],
+ [:name, 'N', "name"],
[:tag_array_for_search, 'U', "tag"]
],
Term List for record #10: 1 2012 IPublicBody-11 MPublicBody NCity College Birmingham Vauthority Za Zamp Zaugust Zbirmingham Zbodi Zciti Zcolleg Zcom Zform Zhref Zhttps Zmerg Zon Zsouth Zsouth_city_college_birmingham Zto Zwhatdotheyknow Zwith Zwww
...still to do - work out how to build the search query to use the name field (cheap alternative - reindex without the notes field)
Ok, this seems to work (although ActsAsXapian
doesn't implement weighting - although it does have access to it via the standard ruby bindings - so implementing this may not be entirely straightforward).
:terms => [
[:variety, 'V', "variety"],
+ [:name_array_for_search, 'N', 'name'],
[:tag_array_for_search, 'U', "tag"]
],
...
+ def name_array_for_search
+ [
+ name.downcase,
+ name.downcase.split,
+ name,
+ name.split
+ ].flatten
+ end
So when we reindex we get:
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody NCardiff NCardiff Council NCouncil Ncardiff Ncardiff council Ncouncil Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcardiff Zcouncil cardiff council
So we should end up saying - "perform the original search but also look at the data we extracted from the name field and show it above the other results if you find anything".
e.g.
model_query = Xapian::Query.new(Xapian::Query::OP_OR, model_classes.map { |mc| "M#{ mc }" })
user_query = $query_parser.parse_query(
query_string,
Xapian::QueryParser::FLAG_BOOLEAN | Xapian::QueryParser::FLAG_PHRASE |
Xapian::QueryParser::FLAG_LOVEHATE |
Xapian::QueryParser::FLAG_SPELLING_CORRECTION)
# add 20% weight to the N (name) term prefix - this effectively treats it as a phrase query/direct match
user_n_query = Xapian::Query.new("N#{ query_string }")
user_query2 = Xapian::Query.new(Xapian::Query::OP_SCALE_WEIGHT, user_n_query, 1.2)
# join it with our original as an OR search
user_query_combined = Xapian::Query.new(Xapian::Query::OP_OR, user_query, user_query2)
# per original example, join as an AND to only search the PublicBody model data
query = Xapian::Query.new(Xapian::Query::OP_AND, model_query, user_query_combined)
# show the actual query being sent to Xapian
p query.description
# "Xapian::Query((MPublicBody AND ((cardiff:(pos=1) AND council:(pos=2)) OR 1.1999999999999999556 * Ncardiff council)))"
limit = options[:limit] || 25
offset = options[:offset] || 0
sort_by_ascending = true
$enquire.query = query
$enquire.collapse_key = 1
@matches = $enquire.mset(offset, limit, 100)
Original result ordering (on my dev box, 106 is the one we want to appear first):
[
#<Xapian::Match:0x0000000ae8c718 @docid=2842>,
@rank=0, @weight=9.846239003879624, @collapse_count=0, @percent=100>,
#<Xapian::Match:0x0000000ae8c6a0 @docid=106>,
@rank=1, @weight=9.579644964708582, @collapse_count=0, @percent=97>,
#<Xapian::Match:0x0000000ae8c600 @docid=2857>,
@rank=2, @weight=8.348687831525604, @collapse_count=0, @percent=84>,
#<Xapian::Match:0x0000000ae8c588 @docid=2856>,
@rank=3, @weight=7.668920438865873, @collapse_count=0, @percent=77>,
#<Xapian::Match:0x0000000ae8c4e8 @docid=15631>,
@rank=4, @weight=7.554302187758436, @collapse_count=1, @percent=76>,
#<Xapian::Match:0x0000000ae8c470 @docid=15633>,
@rank=5, @weight=7.554302187758436, @collapse_count=3, @percent=76>,
#<Xapian::Match:0x0000000ae8c3f8 @docid=3480>,
@rank=6, @weight=7.44743991794151, @collapse_count=0, @percent=75>,
#<Xapian::Match:0x0000000ae8c380 @docid=1457>,
@rank=7, @weight=5.539167927279307, @collapse_count=0, @percent=56>,
#<Xapian::Match:0x0000000ae8c2e0 @docid=2540>,
@rank=8, @weight=3.3380500227472685, @collapse_count=0, @percent=33>
]
New result ordering (again, 106 is the one we're looking for):
#<Xapian::Match:0x0000000a78d070 @docid=106,
@rank=0, @weight=21.67781737965247, @collapse_count=0, @percent=100>,
#<Xapian::Match:0x0000000a78cff8 @docid=2842,
@rank=1, @weight=9.614087170710766, @collapse_count=0, @percent=44>,
#<Xapian::Match:0x0000000a78cd78 @docid=2857,
@rank=2, @weight=8.020960426820963, @collapse_count=0, @percent=37>,
#<Xapian::Match:0x0000000a78ccd8 @docid=2856,
@rank=3, @weight=7.431409141698338, @collapse_count=0, @percent=34>,
#<Xapian::Match:0x0000000a78cc10 @docid=15634,
@rank=4, @weight=7.335637435965685, @collapse_count=3, @percent=33>,
#<Xapian::Match:0x0000000a78c990 @docid=3480,
@rank=5, @weight=7.140584030278554, @collapse_count=0, @percent=32>,
#<Xapian::Match:0x0000000a78c8f0 @docid=15631,
@rank=6, @weight=6.93390727730389, @collapse_count=1, @percent=31>,
#<Xapian::Match:0x0000000a78c878 @docid=1457,
@rank=7, @weight=5.539515976038405, @collapse_count=0, @percent=25>,
#<Xapian::Match:0x0000000a78c800 @docid=2540,
@rank=8, @weight=3.3383779390258526, @collapse_count=0, @percent=15>
]
Noting down how to find xapian info for a particular public body:
# Grep for the database record id (`107`) to find the Xapian internal ID (`106`)
$ xapian-delve -d -t MPublicBody lib/acts_as_xapian/xapiandbs/production -1 | grep PublicBody-107
106 [PublicBody-107]
# Get data for the record
$ xapian-delve -d lib/acts_as_xapian/xapiandbs/production -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcardiff Zcouncil cardiff council
Note: Its just delve
on Ubuntu
Interesting. Currently can't get the direct match to the top in production. We have at least got it on to the first page though.
Even tried de-prioritising texts down to 200
with terms at 1
:
diff --git a/app/models/public_body.rb b/app/models/public_body.rb
index 03b009023..4cbe5222b 100644
--- a/app/models/public_body.rb
+++ b/app/models/public_body.rb
@@ -113,7 +113,7 @@ class PublicBody < ActiveRecord::Base
[:created_at_numeric, 1, "created_at", :number]
],
:terms => [
- [:name_for_search, 'N', 'name'],
+ [:name_for_search, 'N', 'name', 1],
[:variety, 'V', "variety"],
[:tag_array_for_search, 'U', "tag"]
],
diff --git a/lib/acts_as_xapian/acts_as_xapian.rb b/lib/acts_as_xapian/acts_as_xapian.rb
index 85adf4772..e9b92ef3c 100644
--- a/lib/acts_as_xapian/acts_as_xapian.rb
+++ b/lib/acts_as_xapian/acts_as_xapian.rb
@@ -960,7 +960,7 @@ module ActsAsXapian
# default to a relatively low weight to give us flexibility either
# side.
xapian_value = xapian_value(text, nil, true)
- ActsAsXapian.term_generator.index_text(xapian_value, 100)
+ ActsAsXapian.term_generator.index_text(xapian_value, 200)
end
end
Ahhh, I had an empty #name
attribute for the cy
translation for Cardiff Council. I had translations for other authorities, so not sure what happened there.
$ delve -d lib/acts_as_xapian/xapiandbs/development -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ncardiff council Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcardiff Zcouncil cardiff council
$ xapian-delve -d lib/acts_as_xapian/xapiandbs/production -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ncardiff council Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcaerdydd Zcardiff Zcouncil Zcyngor caerdydd cardiff council cyngor
Now I get Cardiff Council appearing third in the list in development:
$ delve -d lib/acts_as_xapian/xapiandbs/development -r 106
Data for record #106:
PublicBody-107
Term List for record #106: IPublicBody-107 MPublicBody Ncardiff council Ucardiff Ucoins Ucoins:W552 Ulocal_council Vauthority Zcaerdydd Zcardiff Zcouncil Zcyngor caerdydd cardiff council cyngor
Some searches are now not returning results when not capitalised:
This is not universal:
Just to note that when this originally occurred, it seemed like capitalization might not be the issue - might instead be some transient issue with search results.
Yeah, there's something strange going on.
If you search for "Home Office", exit the search (hit escape or click elsewhere), then search for "home office" you get a result:
Through the standard search, both "Home Office" and "home office" have the result appearing on page 3:
TypeaheadSearch
does find the authority:
search_results =
TypeaheadSearch.new('home office', model: PublicBody, per_page: 25, page: 1).
xapian_search.
results
pp search_results.map { |r| r[:model] }.map(&:name);nil
["International Policing Assistance Board",
"Police Remuneration Review Body",
"National Policing Improvement Agency",
"Tenant Services Authority",
"National Fraud Authority",
"Home Office Science Advisory Committee",
"Stabilisation Unit",
"Internet Watch Foundation",
"Independent Anti-Slavery Commissioner",
"The Police ICT Company Limited",
"National Crime Agency",
"Home Office", # <<<<<<<<<<<<<<<<<<<<<<<<<<
"Aire Valley Homes Leeds",
"East North East Homes Leeds",
"West North West Homes Leeds",
"Homes and Communities Agency",
"CityWest Homes Limited",
"Ascham Homes",
"Homes for Northumberland",
"Bolton at Home Limited",
"Homes for Islington Limited",
"Hounslow Homes",
"Lewisham Homes",
"Sandwell Homes",
"Sheffield Homes Limited"]
Ah…
%2520
: https://www.whatdotheyknow.com/alaveteli_pro/public_bodies?query=home%2520office%20
the result includes home office: https://www.whatdotheyknow.com/alaveteli_pro/public_bodies?query=home%20officeThough this doesn't seem to matter locally…
We're passing a URL encoded string to Xapian:
Xapian query (0.00184s) Search: home%20office
If we drop the encodeURIComponent
we end up with the search term that we'd expect:
+++ b/app/assets/javascripts/alaveteli_pro/authority_select.js
@@ -88,9 +88,10 @@
},
load: function(query, callback) {
if (!query.length) return callback();
$.getJSON(
searchUrl,
- { query: encodeURIComponent(query) },
+ { query: query },
function (data) {
callback(data)
$selectizeInstance.trigger('type')
What we send:
XHR finished loading: GET "http://10.10.10.30:3000/alaveteli_pro/public_bodies?query=home+ofice".
The Xapian search:
Xapian query (0.00073s) Search: home office
Alternatively, we can explicitly decode on the server side:
diff --git a/app/controllers/alaveteli_pro/public_bodies_controller.rb b/app/controllers/alaveteli_pro/public_bodies_controller.rb
index 45735f81f..cf27380dc 100644
--- a/app/controllers/alaveteli_pro/public_bodies_controller.rb
+++ b/app/controllers/alaveteli_pro/public_bodies_controller.rb
@@ -3,7 +3,8 @@ class AlaveteliPro::PublicBodiesController < AlaveteliPro::BaseController
include AlaveteliPro::PublicBodiesHelper
def index
- query = params[:query] || ""
+ query = URI.decode(params[:query]) || ""
xapian_results = typeahead_search(query, :model => PublicBody,
:exclude_tags => [ 'defunct',
'not_apply' ])
What we send:
XHR finished loading: GET "http://10.10.10.30:3000/alaveteli_pro/public_bodies?query=home%2520office".
The Xapian search:
Xapian query (0.00133s) Search: home office
There's obviously some weird double encoding happening somewhere in this chain.
Its not clear that this is the problem yet though.
Passing home%20office
to the search definitely seems to be preventing us finding the authority in production:
search_results =
TypeaheadSearch.new('home%20office', model: PublicBody, per_page: 25, page: 1).
xapian_search.
results
pp search_results.map { |r| r[:model] }.map(&:name);nil
["Aire Valley Homes Leeds",
"East North East Homes Leeds",
"West North West Homes Leeds",
"Homes and Communities Agency",
"CityWest Homes Limited",
"Ascham Homes",
"Homes for Northumberland",
"Bolton at Home Limited",
"Homes for Islington Limited",
"Hounslow Homes",
"Lewisham Homes",
"Sandwell Homes",
"Sheffield Homes Limited",
"Stockport Homes Limited",
"Air Accidents Investigation Branch",
"English Partnerships",
"Home-Grown Cereals Authority",
"National Policing Improvement Agency",
"Tristar Homes Limited",
"Your Homes Newcastle",
"Tenant Services Authority",
"Ashfield Homes",
"Barnet Homes",
"Berneslai Homes",
"Cheltenham Borough Homes"]
Works fine in development though – but haven't tried against full authority list yet.
Just noting here that I wonder whether we need to add optional weight values to each indexed texts
as well as terms
, e.g.:
acts_as_xapian :texts => [
[:name, 1],
[:short_name, 3],
[:notes, 50]
],
:values => [
# for sorting
[:created_at_numeric, 1, "created_at", :number]
],
:terms => [
[:name_for_search, 'N', 'name'],
[:variety, 'V', "variety"],
[:tag_array_for_search, 'U', "tag"]
],
:eager_load => [:translations]
I think the underlying issue here is https://github.com/mysociety/alaveteli/issues/723.
I should define the difference between a term
and a text
.
For terms, we add the text to be indexed directly to the Xapian::Document
:
doc.add_term(term[1] + value)
doc.add_posting(term[1] + value, 1, Integer(term[3])) if term[3]
For texts, we run the text to be indexed through a term_generator
:
ActsAsXapian.term_generator.index_text(xapian_value, 100)
The TermGenerator
FAQ covers this difference. The TL;DR is that running text through a TermGenerator
performs stemming and splitting words in to separate terms. Adding directly to the Xapian::Document
does none of these and adds the text as-is.
A user wrote to WhatDoTheyKnow after failing to find NHS England on the site.
A search for NHS England without quotes doesn't return NHS England on the first page of search results as might reasonably be expected. (It's currently on page 3)
Just to note that a user contacted the WhatDoTheyKnow support team regarding this matter.
"I tried searching for "Home Office" and "The Home Office" in the find an authority search box initially but it fails to find any home office category at all."
The user found an alternative route to accessing the Home Office authority page (via 'view all authorities' and then refining).
Just to note that we've had another WhatDoTheyKnow user contact us -this time asking that we add the Home Office as a new body.
I've reproduced a search on WDTK this morning and can see that Alaveteli is showing four pages of search results, with the Home Office being the last entry on the last page. 😥
Always.
Another notable entry is the DWP - searching for their acronym (which is their configured short name), the body "Department for Work and Pensions" (24) is fifth on the list. The first entry is Motability (64685), whose only reference to the DWP acronym appears to be contained in a hyperlink within the 'Notes'.
A user has contacted us to point out that searching for the ICO doesn't bring up the ICO. https://www.whatdotheyknow.com/body/list/all?utf8=%E2%9C%93&public_body_query=information+commissioner&commit=Search
A user has written asking us to add the CAA, having failed to find them listed.
I've not been able to re-produce the particular CAA problem.
At the moment, for me, a "everything" search for "CAA" brings up the CAA as the second authority:
The same search for "Civil Aviation Authority" does the same: Changing the order from "newest" to "most relevant" reverses the order:
The "make a request" search also returns it:
+1 I've dealt with a couple of user support cases from users who have not found the authority they were looking for when using the site search due to this issue.
+1 A user has written to us saying that a search for NHS England didn't return the authority in the first 5 pages of search results.
We didn't explicitly speak about it, but there's general feeling that removing notes from the indexable content would be worth doing to improve the situation here. We should try just removing notes from tags as a first step, as directly applied notes are often used to note alternate names for authorities.
In the first comment an issue there is because searches with capital letters in all search words default to sorting by date, not relevance. In all lower case, cardiff council is number 2, still not perfect, but still better, and also deals with the issue with DWP. I don't see why searches with capital letters in every word should change the default sort order, PR at https://github.com/mysociety/alaveteli/pull/8109 Doesn't help with bodies with notes being returned over the body itself, the other main issue raised I can see, but don't see any downsides and is certainly more consistent behaviour.