Open crowbot opened 10 years ago
Tagging as new so we can discuss this in sprint prioritisation. The lack of localised stemming seems to be hitting search in the Norweigian install. See https://alaveteli-dev.nuug.no/nb/search/Finansdepartementet/all versus https://alaveteli-dev.nuug.no/nn/search/finans/all
Having looked into this a bit more, it doesn't look like localised stemming would resolve the Norweigian example given above. 'finans' is not returning results because xapian is using whole word search. This is in contrast to the public authority search, which uses a double wildcarded like
SQL query
Xapian has several built in stemmer languages (including all three Norwegian) http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html#a6c46cedf2047b159a7e4c9d4468242b1
Unless you're actively against it, I would like to tackle this ticket. It's actually pretty simple, the line in acts_as_xapian
needs switching english
for a config option, something like SEARCH_ENGINE_STEMMER_LOCALE
, which can default to english
in the code if an invalid value is provided (ie. one that does not match one of the available stemmers listed in the comment above).
Rebuild the index, and it works.
I'm looking at this right now for madada.fr, and the difference by just switching from english
to fr
is quite striking. It now considers cheval
and chevaux
(horse and horses) to be the same, but chevalier
(knight) is not anymore (but is with the english stemmer).
Likewise, ministere
now matches ministère
(with the accent), which has been one of the most frustrating search queries for us, as it's a common typo.
I suspect a few other sites which use accented languages are likely to see a similar benefit, at least if a stemmer exists (I'm thinking Sweden for instance).
I think @gbp has been drafting some thoughts for this, but while I'm looking my thought was whether we could avoid another config value and set it via the AVAILABLE_LOCALES
/ DEFAULT_LOCALE
keys. Looks like we can use the two letter code there. I'd go for just dropping in the default locale with a fallback to english. At some point we could look at using the current locale for multi-lingual sites, but that might be a bit of a stretch.
Reusing the existing Locales config value sounds good to me (probably because it's the same value for us, but I wonder if it might make things worse in some cases? I'm thinking Belgium where the 2 languages they use are substantially different, and I wonder if using the fr
stemmer might worsen the experience with the Dutch language, which I think is closer to English than French). I guess for multilingual sites, we'd ideally want to have one search index per language for public bodies, but that sounds like a much bigger job than what we're talking about here.
Yeah, that's a fair point. Could do something to choose the first from increasingly general options along the lines of:
# Pick the most specific configured locale that's valid
stemming_locale =
[SEARCH_ENGINE_STEMMER_LOCALE, DEFAULT_LOCALE, 'english'].
select { |locale| valid_stemming_locale?(locale) }
Xapian::Stem.new(stemming_locale)
def valid_stemming_locale?(locale)
# check that the given string is valid for Xapian to use
end
We shouldn't be using an English stemmer in other locales. https://github.com/mysociety/alaveteli/blob/rails-3-develop/lib/acts_as_xapian/acts_as_xapian.rb#L127