Closed disfated closed 11 months ago
Confirmed in my dev env w/ Lute v2.1.0, thanks for the detailed issue.
I created terms for all 4 terms, though the first two are similar:
Here's what's in the db:
sqlite> .headers on
sqlite> .mode column
sqlite> select wotext, wotextlc from words where wolgid = 8;
WoText WoTextLC
------ --------
ışık ışık
Işık işık
için için
İçin i̇çin
The downcasing now isn't handled by javascript, it's all done on the server (that's a recent change), so that needs to be fixed.
After some very brief searching, it looks like there's no built-in way to work around this in PHP. Every solution I looked at boiled down to just hardcoding a replace for those particular characters, for that language. One post at https://www.php.net/manual/tr/function.mb-strtolower.php suggested $str = mb_strtolower($str, mb_detect_encoding($str));
but that didn't make a difference on my machine.
The various solutions were all something like
function trkish_lower($str){
$find = array('İ','I');
$replace = array('i', 'ı');
return str_replace($find,$replace,$str);
}
This would need to be in the Language itself ... I'm not sure if it's worth it to make something general (with field LgDowncaseMapping, containing e.g. for Turkish "İ,i;I,ı"), because as far as I've read Turkish is the only language requiring this! Optionally, Turkish could have a specialized "downcase" function, and that would be used where needed.
Annoyingly, Python seems to have the same problem with these Turkish letters, so the same hacky fix would be needed. string.casefold()
exists in Python but it changes characters too much to be considered (e.g., it removes accents in some situations).
Yeah, I anticipated that PHP has no means to deal with this sort of thing.
Actually, it's crucial to consider this aspect before selecting a stack for code migration.
If there's no support for collations in the programming language and the database engine, implementing them can be quite challenging. And natural languages have all sort of weird things going on. Apart from case-sensitiviy there are also sorting and search problems to contend with. For instance, in Russian when searching, letters е
and ё
should be treated as the same, but distinct in all other cases.
Yes, it's a tough problem space. Even something as common as German separable verbs isn't handled well by any tool out there.
It's odd that Turkish continues to be such a common problem. For Turkish, it should be straightforward to implement the workaround. I don't know what I don't know, so test cases like you gave in the issue are a massive help.
For Russian searching, should е
and ё
be considered the same simply because usage appears to be non-standard? (at least according to https://www.rbth.com/education/328862-yo-russian-alphabet).
I am hoping that the code can decently handle the majority of the cases thrown at it. We do the best we can possibly do, given the constraints :-) As far as I can tell, Python and Sqlite are still fine choices, again for the majority. It may be that some special branches or fixes need to be made to support other cases ... but I hope not.
Thanks again for the detail in this ticket, maybe we can chat further on Discord if you have a sense that there are further difficulties with other languages (like the Russian yo issue).
I currently feel that the easiest way to do this is to create a Turkish parser, inheriting from Space Delimited parser, and to add some "downcase" rule to the abstract parser src/Domain/AbstractParser.php
, e.g.
/**
* Many writing systems can just downcase:
*/
public function getLowercase(string $text) {
return mb_strtolower($text);
}
then the Turkish parser could just hack that a touch:
function getLowercase($text){
$find = array('İ','I');
$replace = array('i', 'ı');
$str = str_replace($find,$replace,$text);
return mb_strtolower($text);
}
The turkishparser option would have to be added to the Language and perhaps to some other code as well, a search for the existing parser stuff should find it.
We'd need test cases for Turkish text to ensure no multibyte corruption weirdness! And then a test or two to ensure that the reading pane updates correctly.
Have some work in progress in iss_71_Turkish_wip
branch pushed to this repo. All tests and a small sample is passing, including the cases listed in this ticket.
EDIT: not quite! On re-saving an existing term, adding a translation, I'm getting "Integrity constraint violation: 19 UNIQUE constraint failed: words.WoTextLC, words.WoLgID".
Need to clean up some of the code, the change introduced some hard-to-follow thrash.
Code cleaned up for Turkish, seems to work. Merged into the develop
branch, will get launched after it bakes in my environment a bit.
Branch iss_71_Turkish
is what was merged, it's in this repo.
Launch with 2.1.2, should be fixed. Thank you @disfated for all of the info.
Description
There's a problem with Turkish language.
There are 2 different "i" letters:
i
, uppercaseİ
ı
, uppercaseI
Thus, case conversion rules for Turkish are slightly different than for English (Latin):
için → IÇIN
(latin, how it is now)için → İÇİN
(turkish, how it should be)IŞIK → işik
(latin, how it is now)IŞIK → ışık
(turkish, how it should be)You can see this difference in browser console:
This leads to all sorts of problems, to name a few:
ışık
andIşık
are now stored separately.İ
cannot be highlighted. You can create term off of them, but it then doesn't get highlighted.Işık
andışık
are different words.To Reproduce
You can use this text to play with.
Turkish works with English settings, so you don't need to create a new language.