Closed eemi closed 7 years ago
Hey @eemi, thanks for the report. It seems you are right in the fact that the stemmer is only stripping "ación" and not "acion". Note, however, that the in the ANSES site, there is a unicodeNormalizer
script being loaded that removes all tildes (and other non-traditional characters), similar to the remove_tildes
function in the README; this is not being handled by the stemmer itself.
Since we are not loading the unicodeNormalizer
in middleman-search, you'd need to manually register either it, or a similar function like remove_tildes
if you want jubilacion and jubilación to yield the same search results.
That being said, we're unsure as to which should be the default behaviour. 1- If we always remove tildes and other non-traditional characters from the text, the spanish stemmer will fail to process stems like "ación" 2- If we don't remove tildes, the stemmer will work right, but searching for jubilación and jubilación yields different results, which is somewhat confusing
Anyway, could you confirm if adding either a remove_tildes
or the entire unicodeNormalizer
to middleman-search solves the issue?
@spalladino thanks for jumping in, thanks for taking the time to check our site!
Adding remove_tildes
doesn't change anything.
I'm trying to register all the unicodeNormalizer functions but it's a bit beyond my knowledge. As soon as possible I will be able to check this and report back to you. Maybe this is working in our site, because unicodeNormalizer is normalising the stemmer and removing EVERY accent?
Again, thanks for your help.
Yes, unicodeNormalizer is removing every accent before a token is processed. See line 10 of http://www.anses.gob.ar/js/lunr.unicodeNormalizer.js.
On Feb 7, 2017, at 11:05 PM, Emiliano Castaño notifications@github.com wrote:
@spalladino https://github.com/spalladino thanks for jumping in, thanks for taking the time to check our site!
Adding remove_tildes doesn't change anything.
I'm trying to register all the unicodeNormalizer functions but it's a bit beyond my knowledge. As soon as possible I will be able to check this and report back to you. Maybe this is working in our site, because unicodeNormalizer is normalising the stemmer and removing EVERY accent?
Again, thanks for your help.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manastech/middleman-search/issues/23#issuecomment-278207274, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaOJBSnRNnv9gN6sqnFRMK80XoxLSuwks5raSL_gaJpZM4L3pvl.
@spalladino I'm closing as this should be solved by using unicodeNormalizer.
Thanks again!
I'm not sure we want to close this issue.
Maybe this one, but we definitely want to tackle this problem from within the extension.
From a high-level PoV, I can't see a single case in which I have full-text search and looking a word with accents yields different results than without accents. I'm pretty much sure you don't want that for Spanish - I'm not that sure if, say, replacing ç with c is desired or not in Portuguese.
So I think it really makes sense to have that transformation done by the extension. Maybe provide an option to disable that (so jubilacion
differs from jubilación
), but I think the default should be to provide the same results.
If that's something we should do via unicodeNormalizer, or if it is a bug in lunr-languages, I don't know. But there's something fundamentally wrong to me if we don't match between corré
and corre
, even if those were different words with different meanings.
I agree that this is something that needs to be solve, but it might not be this gem's responsibility.
In order to include Unicode Normaliser in our pipeline, we did the following:
I'm a bit surprise that this is not handled at the lunr-languages level, I was expecting it to extend Lunr so that it can be use in spanish and the a_0 thing looks a lot like something related to replacing accents. Anyway I'm not a developer, just an enthusiast.
As usual, thanks for the help. Count with us for anything.
Guys, it's me again! While testing the gem, we realised that searching for jubilación and jubilacion will return different results set. We have this
search.language = 'es'
in our config.rb and we are loading lunr.min, lunr.stemmer.support and lunr.es.Looking at search.json, it seems that the stemmer strips the accents from words (like jubilación --> jubilacion), but then the function that should strip "ación" doesn't work as the word now has no accents. Does this make sense?
I'm not registering a new pipeline function to remove tildes (as the example in the readme), as I far as I understand, this should be done by the stemmer.
In a standalone use of Lunr.js/Lunr-languages at ANSES, we are not seeing this issue and that's why I'm reporting the issue here.
Any help is appreciated, kind regards. Emiliano