manastech / middleman-search

LunrJS-based search for Middleman
MIT License
58 stars 31 forks source link

Spanish stemming not working for accented words #23

Closed eemi closed 7 years ago

eemi commented 7 years ago

Guys, it's me again! While testing the gem, we realised that searching for jubilación and jubilacion will return different results set. We have this search.language = 'es' in our config.rb and we are loading lunr.min, lunr.stemmer.support and lunr.es.

Looking at search.json, it seems that the stemmer strips the accents from words (like jubilación --> jubilacion), but then the function that should strip "ación" doesn't work as the word now has no accents. Does this make sense?

I'm not registering a new pipeline function to remove tildes (as the example in the readme), as I far as I understand, this should be done by the stemmer.

In a standalone use of Lunr.js/Lunr-languages at ANSES, we are not seeing this issue and that's why I'm reporting the issue here.

Any help is appreciated, kind regards. Emiliano

spalladino commented 7 years ago

Hey @eemi, thanks for the report. It seems you are right in the fact that the stemmer is only stripping "ación" and not "acion". Note, however, that the in the ANSES site, there is a unicodeNormalizer script being loaded that removes all tildes (and other non-traditional characters), similar to the remove_tildes function in the README; this is not being handled by the stemmer itself.

Since we are not loading the unicodeNormalizer in middleman-search, you'd need to manually register either it, or a similar function like remove_tildes if you want jubilacion and jubilación to yield the same search results.

That being said, we're unsure as to which should be the default behaviour. 1- If we always remove tildes and other non-traditional characters from the text, the spanish stemmer will fail to process stems like "ación" 2- If we don't remove tildes, the stemmer will work right, but searching for jubilación and jubilación yields different results, which is somewhat confusing

Anyway, could you confirm if adding either a remove_tildes or the entire unicodeNormalizer to middleman-search solves the issue?

eemi commented 7 years ago

@spalladino thanks for jumping in, thanks for taking the time to check our site!

Adding remove_tildes doesn't change anything.

I'm trying to register all the unicodeNormalizer functions but it's a bit beyond my knowledge. As soon as possible I will be able to check this and report back to you. Maybe this is working in our site, because unicodeNormalizer is normalising the stemmer and removing EVERY accent?

Again, thanks for your help.

spalladino commented 7 years ago

Yes, unicodeNormalizer is removing every accent before a token is processed. See line 10 of http://www.anses.gob.ar/js/lunr.unicodeNormalizer.js.

On Feb 7, 2017, at 11:05 PM, Emiliano Castaño notifications@github.com wrote:

@spalladino https://github.com/spalladino thanks for jumping in, thanks for taking the time to check our site!

Adding remove_tildes doesn't change anything.

I'm trying to register all the unicodeNormalizer functions but it's a bit beyond my knowledge. As soon as possible I will be able to check this and report back to you. Maybe this is working in our site, because unicodeNormalizer is normalising the stemmer and removing EVERY accent?

Again, thanks for your help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manastech/middleman-search/issues/23#issuecomment-278207274, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaOJBSnRNnv9gN6sqnFRMK80XoxLSuwks5raSL_gaJpZM4L3pvl.

eemi commented 7 years ago

@spalladino I'm closing as this should be solved by using unicodeNormalizer.

Thanks again!

matiasgarciaisaia commented 7 years ago

I'm not sure we want to close this issue.

Maybe this one, but we definitely want to tackle this problem from within the extension.

From a high-level PoV, I can't see a single case in which I have full-text search and looking a word with accents yields different results than without accents. I'm pretty much sure you don't want that for Spanish - I'm not that sure if, say, replacing ç with c is desired or not in Portuguese.

So I think it really makes sense to have that transformation done by the extension. Maybe provide an option to disable that (so jubilacion differs from jubilación), but I think the default should be to provide the same results.

If that's something we should do via unicodeNormalizer, or if it is a bug in lunr-languages, I don't know. But there's something fundamentally wrong to me if we don't match between corré and corre, even if those were different words with different meanings.

eemi commented 7 years ago

I agree that this is something that needs to be solve, but it might not be this gem's responsibility.

In order to include Unicode Normaliser in our pipeline, we did the following:

  1. Modified unicode normalizer to extend the tokenizer function and not to overwrite it completely (it was already merged into the repo)
  2. Forked this gem at https://gitlab.com/ANSES/middleman-search/ and include unicode normalizer inside the gem.
  3. Included unicode normalizer when loading the index.
  4. Dance, now it's working 🎉

I'm a bit surprise that this is not handled at the lunr-languages level, I was expecting it to extend Lunr so that it can be use in spanish and the a_0 thing looks a lot like something related to replacing accents. Anyway I'm not a developer, just an enthusiast.

As usual, thanks for the help. Count with us for anything.