koreader / crengine

This is the KOReader CREngine fork. It cross-pollinates with the official CoolReader repository at https://github.com/buggins/coolreader, in case you were looking for that one.
70 stars 45 forks source link

Combine hyphenation patterns for Serbian Cyrillic and Latin scripts #566

Closed eevan78 closed 2 months ago

eevan78 commented 3 months ago

This pull request continues on the pull request #372. As Serbian language uses two scripts with different codepoints, it is safe to combine the patterns into one file. In that way, it doesn't matter which script is used, and even texts that use both scripts will be properly hyphenated. Only the main part of the language tag in (X)HTML should be consulted to load the appropriate patterns. So sr, sr-Cyrl, sr-Latn, and regional versions of these (like sr_RS) should all load the same pattern file. This approach is already successfully implemented in ConTeXt.

Patterns have been converted from https://devbase.net/dict-sr/ same ones used in LibreOffice extension Serbian Spellchecker.


This change is Reviewable

Frenzie commented 3 months ago

As Serbian language uses two scripts with different codepoints, it is safe to combine the patterns into one file.

You mean the Latin one is currently completely absent I presume? As phrased it sounds a bit like you forgot to delete it. :-)

poire-z commented 3 months ago

Pinging @strn @roshavagarga who contributed to #372 for thoughts and approval.

roshavagarga commented 3 months ago

@poire-z I'd say @strn would be able to give a more valid opinion around whether this is something that should be done, as my understanding of Serbian and the cultural connotations of the above change are fairly basic.

If it works out-of-the-box and there aren't any cultural reasons not to do this, I don't see an issue.

I would note, however, that I'm not sure how the source(s) used for this compare to the one we currently use for Serbian, so possibly something to compare and/or test? (Taken from here)

eevan78 commented 3 months ago

As Serbian language uses two scripts with different codepoints, it is safe to combine the patterns into one file.

You mean the Latin one is currently completely absent I presume? As phrased it sounds a bit like you forgot to delete it. :-)

You are right, they are now absent. When I read a Serbian book written in Latin script, I have to change the language to Croatian. That loads the croatian patterns that are based on the same Latin script. Otherwise, there is no hyphenation.

eevan78 commented 3 months ago

I would note, however, that I'm not sure how the source(s) used for this compare to the one we currently use for Serbian, so possibly something to compare and/or test? (Taken from here)

Those are the same patterns, made by Dejan Muhamedagić, used in TeX. I just had to convert the codepages to UTF-8 as these patterns use ISO8859-2 (for Latin patterns) and ISO8859-5 (for Cyrillic patterns) encoding.

Serbian hyphenation patterns are derived from official TeX patterns for Serbocroatian language (Cyrillic and Latin) created by Dejan Muhamedagić, version 2.02 from 22 June 2008 adopted for usage with Hyphen hyphenation library and released under GNU LGPL version 2.1 or later.

poire-z commented 2 months ago

Pinging again @strn - please give us some feedback.

strn commented 2 months ago

@poire-z , sorry for the late reply.

Yes, if patterns are the same, then they should be used for hyphenating texts in Serbian language - regardless of how it is written now.

However, let me just emphasize and remind you once again that only Serbian Cyrillic is a valid Serbian language alphabet. Usage of Croatian Latin alphabet comes from Yugoslav era and is best to be left there.

eevan78 commented 2 months ago

As I've already said, this is just a technical matter that removes the need to change languages when reading books typeset on the Latin script.

@strn Can you please point to some valid reference that supports your claims? Are you saying that for example these are Croatian books? Cyrillic script is defined as an official script in the Constitution, and both scripts are used in a daily correspondence, media, newspapers and publishing. No matter if we like, it or not. Personally, I'm using Cyrillic script, but many other people that I know are not. That's the only reason I'm proposing to unify the patterns in one file, purely as a convenience to the user.