[Bug] Turkish İi Iı casing problem

disfated commented 11 months ago

Description

There's a problem with Turkish language.

There are 2 different "i" letters:

with dot: lowercase i, uppercase İ
without dot: lowercase ı, uppercase I

Thus, case conversion rules for Turkish are slightly different than for English (Latin):

için → IÇIN (latin, how it is now)
için → İÇİN (turkish, how it should be)
IŞIK → işik (latin, how it is now)
IŞIK → ışık (turkish, how it should be)

You can see this difference in browser console:

console.log('için'.toUpperCase())           // → IÇIN
console.log('için'.toLocaleUpperCase('tr')) // → İÇİN
console.log('IŞIK'.toLowerCase())           // → işik
console.log('IŞIK'.toLocaleLowerCase('tr')) // → ışık

This leads to all sorts of problems, to name a few:

words ışık and Işık are now stored separately.
words staring with İ cannot be highlighted. You can create term off of them, but it then doesn't get highlighted.
you cannot change casing of terms because Lute thinks that Işık and ışık are different words.
searching such terms you now have to account for their casing.

To Reproduce

You can use this text to play with.

Bla ışık bla

Bla Işık bla

Bla için bla

Bla İçin bla

Turkish works with English settings, so you don't need to create a new language.

jzohrab commented 11 months ago

Confirmed in my dev env w/ Lute v2.1.0, thanks for the detailed issue.

I created terms for all 4 terms, though the first two are similar:

Here's what's in the db:

sqlite> .headers on
sqlite> .mode column
sqlite> select wotext, wotextlc from words where wolgid = 8;
WoText  WoTextLC
------  --------
ışık    ışık    
Işık    işık    
için    için    
İçin    i̇çin

The downcasing now isn't handled by javascript, it's all done on the server (that's a recent change), so that needs to be fixed.

jzohrab commented 11 months ago

After some very brief searching, it looks like there's no built-in way to work around this in PHP. Every solution I looked at boiled down to just hardcoding a replace for those particular characters, for that language. One post at https://www.php.net/manual/tr/function.mb-strtolower.php suggested $str = mb_strtolower($str, mb_detect_encoding($str)); but that didn't make a difference on my machine.

The various solutions were all something like

function trkish_lower($str){
    $find = array('İ','I');
    $replace = array('i', 'ı');
    return str_replace($find,$replace,$str);
}

This would need to be in the Language itself ... I'm not sure if it's worth it to make something general (with field LgDowncaseMapping, containing e.g. for Turkish "İ,i;I,ı"), because as far as I've read Turkish is the only language requiring this! Optionally, Turkish could have a specialized "downcase" function, and that would be used where needed.

Annoyingly, Python seems to have the same problem with these Turkish letters, so the same hacky fix would be needed. string.casefold() exists in Python but it changes characters too much to be considered (e.g., it removes accents in some situations).

disfated commented 11 months ago

Yeah, I anticipated that PHP has no means to deal with this sort of thing.

Actually, it's crucial to consider this aspect before selecting a stack for code migration.

If there's no support for collations in the programming language and the database engine, implementing them can be quite challenging. And natural languages have all sort of weird things going on. Apart from case-sensitiviy there are also sorting and search problems to contend with. For instance, in Russian when searching, letters е and ё should be treated as the same, but distinct in all other cases.

jzohrab commented 11 months ago

Yes, it's a tough problem space. Even something as common as German separable verbs isn't handled well by any tool out there.

It's odd that Turkish continues to be such a common problem. For Turkish, it should be straightforward to implement the workaround. I don't know what I don't know, so test cases like you gave in the issue are a massive help.

For Russian searching, should е and ё be considered the same simply because usage appears to be non-standard? (at least according to https://www.rbth.com/education/328862-yo-russian-alphabet).

I am hoping that the code can decently handle the majority of the cases thrown at it. We do the best we can possibly do, given the constraints :-) As far as I can tell, Python and Sqlite are still fine choices, again for the majority. It may be that some special branches or fixes need to be made to support other cases ... but I hope not.

Thanks again for the detail in this ticket, maybe we can chat further on Discord if you have a sense that there are further difficulties with other languages (like the Russian yo issue).

jzohrab commented 11 months ago

I currently feel that the easiest way to do this is to create a Turkish parser, inheriting from Space Delimited parser, and to add some "downcase" rule to the abstract parser src/Domain/AbstractParser.php, e.g.

    /**
     * Many writing systems can just downcase:
     */
    public function getLowercase(string $text) {
        return mb_strtolower($text);
    }

then the Turkish parser could just hack that a touch:

function getLowercase($text){
    $find = array('İ','I');
    $replace = array('i', 'ı');
    $str = str_replace($find,$replace,$text);
    return mb_strtolower($text);
}

The turkishparser option would have to be added to the Language and perhaps to some other code as well, a search for the existing parser stuff should find it.

We'd need test cases for Turkish text to ensure no multibyte corruption weirdness! And then a test or two to ensure that the reading pane updates correctly.

jzohrab commented 11 months ago

Have some work in progress in iss_71_Turkish_wip branch pushed to this repo. All tests and a small sample is passing, including the cases listed in this ticket.

EDIT: not quite! On re-saving an existing term, adding a translation, I'm getting "Integrity constraint violation: 19 UNIQUE constraint failed: words.WoTextLC, words.WoLgID".

Need to clean up some of the code, the change introduced some hard-to-follow thrash.

jzohrab commented 11 months ago

Code cleaned up for Turkish, seems to work. Merged into the develop branch, will get launched after it bakes in my environment a bit.

Branch iss_71_Turkish is what was merged, it's in this repo.

jzohrab commented 11 months ago

Launch with 2.1.2, should be fixed. Thank you @disfated for all of the info.

jzohrab / lute

[Bug] Turkish İi Iı casing problem #71