Improving Serbian normalization

stalker314314 commented 6 years ago

I noticed that Serbian normalization is not working properly, also took a peek at the code. So, basically, I know what is wrong and could fix it, but not sure what else could be affected.

Problem: Serbian language uses two scripts - cyrillic and latin (name:sr and name:sr-Latin in OSM). Normalization that happens in utfasciitable.h is not following what people usually transliterate colloquially. To make matters interesting, they are wrong for both cyrillic and latin scripts. Let's give example - street "Ђаковачка" or in latin "Đakovačka"[1]. All people with US keyboard in Serbia would search for "Djakovacka" (Đ->Dj, č->c) and unfortunately you cannot find this[2]. Let me try to break it:

Cyrillic:
- Ђ -> dj ✔
- ч -> ch ✘
Latin
- Đ -> d ✘
- č -> c ✔

So, there are two problems and two possible solutions to this:

Cyrillic "ч" should be normalized to "c", or/either
Latin "đ" should be normalized to "dj"

I am not sure can we just change those to whatever we want without causing major regression, and what are other possible ways to fix it. I guess workaround would be to place "name:en" or "alt_name" to "Serbian common normalization" with some script whenever there are those "incompatible combinations", but I would try to find solution systematically (directly in Nominatim).

Here are used chars: Ђ, ч, Đ and č.

[1] http://nominatim.openstreetmap.org/search.php?q=%D1%92%D0%B0%D0%BA%D0%BE%D0%B2%D0%B0%D1%87%D0%BA%D0%B0 [2] http://nominatim.openstreetmap.org/search.php?q=Djakovacka

stalker314314 commented 6 years ago

Just to give a bit more context - both name:sr and name:sr-Latn can be assumed to be populated in OSM and this is why any of normalization fixes would work (if we assume scenario "user types Djakovacka and gets results"

lonvia commented 3 years ago

I was hoping that the new ICU-based transliteration fixes this but it looks like it still doesn't agree with you. It gives a transliteration of "d" for both "Ђ" and "Đ". The only way to solve this is to have our own language-specific transliterations. The good news is that now everything is in place to do that. The bad news is that I'm not sure when we'll get around to actually implementing it. But I see that you have worked around the issue by introducing 'int_name's in the meantime.

stalker314314 commented 3 years ago

Yes, but int_name does not scale, is error prone and does not work for new data - as you said, it is just a workaround. It is awesome if there is solution for this problem on the way! Is there any way I can help with this (either with implementing/coding this for "Đ", or for broader range of chars, or just guiding someone how it should work for Serbian)?

lonvia commented 3 years ago

We'd need a full set of transliteration rules for the language. If you can compile two sets of ICU rules one for cyrillic, one for latin how you transliterate Serbian usually into ascii, that would be a great starting point.

Note that Nominatim will do a little bit of normalization to start with according to these rules. It's mostly getting rid of some odd unicode stuff but the important part is: it all starts with lower-case already.

stalker314314 commented 3 years ago

So, let me try it. I also checked ISO 9, but it doesn't help, as it just converts to Latin, but not to English alphabet (for example "Ђ" becomes "Đ" and now problem is how to normalize that further). Before proceeding, be aware that I am first time hearing aboutICU transformation (and I really hope this should be already standardized somewhere). I will just giving mapping from cyr -> lat here (without "<>" (potential reversible chars) and without trying to merge upper and lower case letters). If you need those two, those are beyond my capabilities. Here is cyrillic upper case:

А > A ;
Б > B ;
В > V ;
Г > G ;
Д > D ;
Ђ > Dj ;
Е > E ;
Ж > Z ;
З > Z ;
И > I ;
Ј > J ;
К > K ;
Л > L ;
Љ > Lj ;
М > M ;
Н > N ;
Њ > Nj ;
О > O ;
П > P ;
Р > R ;
С > S ;
Т > T ;
Ћ > C ;
У > U ;
Ф > F ;
Х > H ;
Ц > C ;
Ч > C ;
Џ > Dz ;
Ш > S ;

And cyrillic lower case:

а > a ;
б > b ;
в > v ;
г > g ;
д > d ;
ђ > dj ;
е > e ;
ж > z ;
з > z ;
и > i ;
ј > j ;
к > k ;
л > l ;
љ > lj ;
м > m ;
н > n ;
њ > nj ;
о > o ;
п > p ;
р > r ;
с > s ;
т > t ;
ћ > c ;
у > u ;
ф > f ;
х > h ;
ц > c ;
ч > c ;
џ > dz ;
ш > s ;

For latin, I will ignore letters that map to same letters in English alphabet. Same normalization for Latin is useful for also Croatian, Bosnian and Slovene alphabet:

Č > C ;
Ć > C ;
Š > S ;
Đ > Dj ;
Ž > Z ;
Dž > Dz ; (already covered?)
č > c ;
ć > c ;
š > s ;
đ > dj ;
ž > z ;
dž > dz ; (already covered?)

lonvia commented 3 years ago

Thank you. That was exactly what I was looking for. It will make a nice first use-case for language-specific transliteration.

mtmail commented 3 years ago

@lonvia Two lifetimes ago I used http://sphinxsearch.com/wiki/doku.php?id=charset_tables#slovak to tweak a search engine

osm-search / Nominatim

Improving Serbian normalization #862