sillsdev / web-languageforge

Language Forge: Online Collaborative Dictionary Building on the Web and Phone.
https://languageforge.org
MIT License
44 stars 29 forks source link

Searching with diacritics #64

Closed EThornton closed 7 years ago

EThornton commented 7 years ago

The search does not work with unicode accented vowels. I will check for compatibility with other characters. Glottal stops work fine.

screen shot 2017-01-19 at 10 16 20 am

It is easy enough to just search for the non-diacritic parts of words.

screen shot 2017-01-19 at 10 16 47 am

A further look reveals that if I copy and past directly from the entry, I get the accented a created as a combined diacritic and the search works.

screen shot 2017-01-19 at 10 21 52 am

Thanks! Elliot

megahirt commented 7 years ago

Hi Elliot,

Thanks for finding this one and documenting it well. I've seen this in the wild before with one other user and so this confirms to me about the problem. We'll keep you updated on the fix.

Chris

On Thu, Jan 19, 2017 at 10:28 PM, EThornton notifications@github.com wrote:

The search does not work with unicode accented vowels. I will check for compatibility with other characters. Glottal stops work fine.

[image: screen shot 2017-01-19 at 10 16 20 am] https://cloud.githubusercontent.com/assets/15068839/22112309/7102d268-de30-11e6-851b-9f8cf38e9487.png

It is easy enough to just search for the non-diacritic parts of words.

[image: screen shot 2017-01-19 at 10 16 47 am] https://cloud.githubusercontent.com/assets/15068839/22112308/70ef9694-de30-11e6-8274-d1c089eda8e6.png

A further look reveals that if I copy and past directly from the entry, I get the accented a created as a combined diacritic and the search works.

[image: screen shot 2017-01-19 at 10 21 52 am] https://cloud.githubusercontent.com/assets/15068839/22112497/23324c84-de31-11e6-8520-d1569709717b.png

Thanks! Elliot

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sillsdev/web-languageforge/issues/64, or mute the thread https://github.com/notifications/unsubscribe-auth/ADSPKbZaPzaKQE1cMLHhpE-G8B7c73zkks5rT4EWgaJpZM4LoPSE .

rmunn commented 7 years ago

@EThornton - What keyboard layout are you using to type the Unicode-accented vowels?

@megahirt - I suspect this is an issue of NFC vs NFD encoding, and that Elliot's keyboard is producing NFC when he types accented characters like á. I looked at Elliot's project and found that most entries were stored in NFD form, but a few entries were stored in NFC instead. Typing á (U+00E1 LATIN SMALL LETTER A WITH ACUTE) into the search bar produced 5 matches, but when I copied and pasted the decomposed form (U+0061 LATIN SMALL LETTER A plus U+0301 COMBINING ACUTE ACCENT) into the search bar, I got 1669 matches. FLEx stores Unicode data in NFD, and we need to make sure that LF is doing the same.

rmunn commented 7 years ago

Incidentally, the five matches I got when typing á (the composed form) into the search bar were:

I haven't yet found a common pattern between these five words other than the fact that they all use composed á instead of decomposed , but I'll continue looking into it.

megahirt commented 7 years ago

@rmunn you are correct that this is a unicode normalized form issue. Here's the broad strokes solution:

My initial thought is that we should use string.normalize() in javascript to accomplish this: https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

irahopkinson commented 7 years ago

Sorry, this fix has not yet shipped to the live server. We will close this again when it has reached languageforge.org.

megahirt commented 7 years ago

@EThornton We have fixed this issue on languageforge.org and so I'm closing this issue now. Thanks.