Match fails for Turkish Locale due to Turkish "i" problem

mdahamiwal commented 7 years ago

Its a common problem in programming while comparing string in Turkish language eg.

Turkish language has four i's including case: "ı" (dotless lowercase) & "I" (dotless uppercase) "i" (dotted lowercase) & "İ" (dotted uppercase)

The library internally uses toLowerCase() and toUpperCase() , this results in mismatch if query or candidates contains dotless is as "I".toLowerCase() = "i"

What we need is "I".toLocaleLowerCase() = "ı"

Fix: Use toLocaleLowerCase() to ensure that turkish locale is honored while lowercase/uppercase conversions.

jeancroy commented 7 years ago

Question Turkish user will never try to user "i" (dotted lowercase) as a lazy selector for "I" (dot less uppercase) ?

I ask because (at least in the case of atom) the library is focused on programming language & path. Someone with Turkish locale may still need to deal with a InstrumentationManager class & the like. I'm a bit hesitant to having a fix that would break English use case.

In case of doubt, I prefer to match sightly too often. In this case lowercase "i" could match both uppercase "I". That would support mixed language use.

I would be ok with having user defined "lowercaseFunc(x) => x.toLocaleLowercase()" and similar "uppercaseFunc" if you are certain your user won't mix languages.

mdahamiwal commented 7 years ago

Nope, if the current locale is Turkish, the two i s are considered different. That said, user won't expect "i" to match "I". If user is looking for results matching "I", they would use locale specific characters. Otherwise, there is no way to handle these scenarios well in code. I can re-check with internal testers in my team but I pretty sure of the behavior. Thanks

jeancroy commented 7 years ago

Basically the question is do Turkish people have different expectation for Turkish text and English text. It's very possible your suggestion is the way to go. If so I'll probably merge and cut a version soon.

mdahamiwal commented 7 years ago

I confirmed with an actual Turkish user and he mentioned that you should only take care of case insensitive match, as we can never know if search term is Turkish or English. So, as far as we are using toLocaleLowerCase() we should be covered.

jeancroy commented 7 years ago

you should only take care of case insensitive match, as we can never know if search term is Turkish or English

I'm not sure I understand that sentence. toLocaleLowerCase will both enable case-intensive Turkish match i<>İ, ı<>I and break case-insensitive English ones i<>I.

Do you mean case sensitive instead ? (If one never know the language, one does not apply language specific transform, I think this one works out of the box now)

Are you OK with this being an option ? Something like : fuzzaldrin.filter: (candidates, query, {localeCase:true})

I feel like the reasonable thing to do it letting more user interact with it and see if they like it. I may have locale enabled by default, do you know if it's much slower ? Or it's all pre-computed table and similar speed ?

mdahamiwal commented 7 years ago

that you should only take care of case insensitive match, as we can never know if search term is Turkish or English.

That means we should take care of matching for current locale (i<>I needn't match on an Turkish locale but i<>İ, ı<>I should). So, using toLocaleLowerCase() or toLocaleUpperCase() should cover it.

Are you OK with this being an option ? Something like : fuzzaldrin.filter: (candidates, query, {localeCase:true})

Sounds good to me.

do you know if it's much slower ? Or it's all pre-computed table and similar speed ?

That brings out a good point, I did a benchmark run on chrome and Edge and toLocaleLowerCase turned out to be 50-60% slower for full strings. https://jsperf.com/localeowercase-vs-lowercase

However, as per our logic, the only place we do full string toLowerCase is one time in query and before scoring the matched strings. So, that doesn't seem to be regressing the perf in most of the cases. thoughts ?

jeancroy commented 7 years ago

So, that doesn't seem to be regressing the perf in most of the cases. thoughts ?

That's the spirit. Some user will type "nondescript" query, as part of word, or by taping the name of a folder with large sub-hierarchy. In the past I've toyed with the idea of writing my own case-insensitive "indexOf" and getting rid of transforming the subject. Alternatively some cache of lowercase subjects may make the problem disappear over multiple search.

Per this file of exception in unicode case folding ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt Composite uppercase are handled by the truncated uppercase routine. Conditional case folding would be handling by "toLocaleLower" etc.

Since this issue is related to a few language Turkish, Azeri, Lithuanian and is data dependent I think I'll disable locale-lowercase by default, and enable user to opt in.

--

I'll be working on making your changes option-dependent later today or tomorrow.

jeancroy / fuzz-aldrin-plus

Match fails for Turkish Locale due to Turkish "i" problem #33