mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Xapian stemmer should be localised #1574

Open crowbot opened 10 years ago

crowbot commented 10 years ago

We shouldn't be using an English stemmer in other locales. https://github.com/mysociety/alaveteli/blob/rails-3-develop/lib/acts_as_xapian/acts_as_xapian.rb#L127

crowbot commented 9 years ago

Tagging as new so we can discuss this in sprint prioritisation. The lack of localised stemming seems to be hitting search in the Norweigian install. See https://alaveteli-dev.nuug.no/nb/search/Finansdepartementet/all versus https://alaveteli-dev.nuug.no/nn/search/finans/all

crowbot commented 9 years ago

Having looked into this a bit more, it doesn't look like localised stemming would resolve the Norweigian example given above. 'finans' is not returning results because xapian is using whole word search. This is in contrast to the public authority search, which uses a double wildcarded like SQL query

garethrees commented 9 years ago

Xapian has several built in stemmer languages (including all three Norwegian) http://xapian.org/docs/apidoc/html/classXapian_1_1Stem.html#a6c46cedf2047b159a7e4c9d4468242b1

laurentS commented 9 months ago

Unless you're actively against it, I would like to tackle this ticket. It's actually pretty simple, the line in acts_as_xapian needs switching english for a config option, something like SEARCH_ENGINE_STEMMER_LOCALE, which can default to english in the code if an invalid value is provided (ie. one that does not match one of the available stemmers listed in the comment above). Rebuild the index, and it works.

I'm looking at this right now for madada.fr, and the difference by just switching from english to fr is quite striking. It now considers cheval and chevaux (horse and horses) to be the same, but chevalier (knight) is not anymore (but is with the english stemmer). Likewise, ministere now matches ministère (with the accent), which has been one of the most frustrating search queries for us, as it's a common typo. I suspect a few other sites which use accented languages are likely to see a similar benefit, at least if a stemmer exists (I'm thinking Sweden for instance).

garethrees commented 9 months ago

I think @gbp has been drafting some thoughts for this, but while I'm looking my thought was whether we could avoid another config value and set it via the AVAILABLE_LOCALES / DEFAULT_LOCALE keys. Looks like we can use the two letter code there. I'd go for just dropping in the default locale with a fallback to english. At some point we could look at using the current locale for multi-lingual sites, but that might be a bit of a stretch.

laurentS commented 9 months ago

Reusing the existing Locales config value sounds good to me (probably because it's the same value for us, but I wonder if it might make things worse in some cases? I'm thinking Belgium where the 2 languages they use are substantially different, and I wonder if using the fr stemmer might worsen the experience with the Dutch language, which I think is closer to English than French). I guess for multilingual sites, we'd ideally want to have one search index per language for public bodies, but that sounds like a much bigger job than what we're talking about here.

garethrees commented 9 months ago

Yeah, that's a fair point. Could do something to choose the first from increasingly general options along the lines of:

# Pick the most specific configured locale that's valid
stemming_locale =
  [SEARCH_ENGINE_STEMMER_LOCALE, DEFAULT_LOCALE, 'english'].
  select { |locale| valid_stemming_locale?(locale) }

Xapian::Stem.new(stemming_locale)

def valid_stemming_locale?(locale)
  # check that the given string is valid for Xapian to use
end