Add typography rules for Russian

dmalinovsky commented 5 months ago

Russian typography frowns upon having one letter prepositions and conjunctions hanging at the end of the line.

There are many Russian resources discussing this, I've linked some of the better known ones: https://www.artlebedev.ru/kovodstvo/sections/62/ https://gramota.ru/spravka/vopros/294020 https://gramota.ru/spravka/vopros/219773

This change is

poire-z commented 5 months ago

Pinging @pkb @virxkane @hius07 @mergen3107 @ssvb for confirmation this is the thing to do, and If anyone knows if this is true for Ukrainian and Belarusian ?

(And if it is the right thing to do, why have you been fine without it for so long ? :) Because it's some minor expectation ? How does the risk of having more hyphenation or spacing between words compares to getting this nice ?)

dmalinovsky commented 5 months ago

I stopped using KOREader for a while after device upgrade. Now I'm back to it, and this is pretty noticeable. As an afterthought, I think it's better to scale down this change and prohibit one letter words only, so we'll have a balance between too much spacing and nice typography.

Frenzie commented 5 months ago

@poire-z

(And if it is the right thing to do, why have you been fine without it for so long ? :)

In English and possibly many other languages it'd be preferable to avoid it when reasonably possible as well (not to be confused with prohibiting it :-) but it's a fairly rare occurrence.

dmalinovsky commented 5 months ago

@poire-z

(And if it is the right thing to do, why have you been fine without it for so long ? :)

In English and possibly many other languages it'd be preferable to avoid it when reasonably possible as well (not to be confused with prohibiting it :-) but it's a fairly rare occurrence.

Okay, perhaps “prohibits” is a too strong word. I updated the PR description.

ssvb commented 5 months ago

for confirmation this is the thing to do, and If anyone knows if this is true for Ukrainian and Belarusian ?

The Belarusian hyphenation rules can be found here. Basically, in layman terms:

it's incorrect to hyphenate a word in such a way, that a single letter is left out on a separate line.
"дж" and "дз" are digraphs and they shouldn't be broken by hyphenation (except for the words, which start with "пад-" or "ад-" prefixes).
the hyphenated half of the word on the second line can't start with an apostrophe or letters "й", "ў", "ь".

dmalinovsky commented 5 months ago

The Belarusian hyphenation rules can be found

I was more curious about whether it's okay to leave single letter prepositions at the end of the line — e.g. "з", "ў", etc.

ssvb commented 5 months ago

I was more curious about whether it's okay to leave single letter prepositions at the end of the line — e.g. "з", "ў", etc.

Ah, sorry, I somehow thought that it was a question about hyphenation. I don't remember any rules regulating one-letter words left hanging out in the beginning or in end of a line. Probably nobody really cares. It's just a matter of aesthetic style and if your patch makes the text look better, then go for it.

mergen3107 commented 5 months ago

@poire-z To answer your questions... I started using crengine in form of CoolReader back in 2011 on Nook Simple Touch with Glowlight. At first, I was very picky about hyphenation patterns, but then I learned hyphenation engine isn't perfect, and it is hard to tweak all these cases.

As time went on, I was becoming less and less picky up to the point when I am probably illiterate in Russian hyphenations :D so I stopped recognizing these patterns, because I got what I wanted from hyphenations - saved spaced and straight text boxes.

But thank you @dmalinovsky for bringing this up, I'll revise all of these again and dig up my old notes with complaints :D

poire-z commented 5 months ago

Just some warnings - as I can't judge about what's preferable, not reading Russian:

This is not about hyphenation, but about where to not line wrap when there is a normal space that should usually wrap.

Translated to English, https://www.artlebedev.ru/kovodstvo/sections/62/ says (and I think there is just that about this topic):

Каждый раз нужно вникать в смысл текста и привязывать предлоги и союзы к следующему за ними слову, а частицы — к предыдущему. Each time you need to delve into the meaning of the text and link prepositions and conjunctions to the word following them, and particles to the previous one.

That's quite little, and hardly reads as "Russian typography frowns upon having one letter words hanging at the end of the line." :) May be prepositions and conjunctions are mostly one-letter-words - but so could "particles" ? And reading Kafka, you would then never see "Joseph K" at the end of a line, but always:

blah blah and Joseph
K reconsidered something
blah blah blah blah blah

And to ensure that, the code may need to more often increase spacing between words:

blah   blah  and  Joseph
K reconsidered something

or hyphenate the following word:

blah blah and Joseph K re-
considered something blah

So, it's not free benefit and auto-looks-better.

It should also just not be a question of taste - or it should be a taste shared by many. For Polish, there are indeed some state/academic documentation that specifies the letters that are prepositions - and K is among them, so they would have this issue with Kafka :) Pinging @ptrm : do such false positive happens often?

The best way to be sure it's something that is worth doing is to check a few books by good Russian publishers, and View HTML some text selection with such single letter words, and Switch to debug view, and see if these publishers have explicitely put a no-break-space after such letters, we show them as ␣. If there are some but not many, and many single letter don't have them, it's that it's really dependant on the text/word/meaning/context, and we may not be able to do it automatically, and have to expect publishers to do that with adding   at the right places. I think we saw a lot of them in some polish book at the time we added it, so it felt safe to ensure it via the code (even if I'm sure it causes false positives).

dmalinovsky commented 5 months ago

And reading Kafka, you would then never see "Joseph K" at the end of a line, but always:

In Russian, initials will have a period added, so it'll be "Joseph K." and won't be affected by my change. As far as I know, only prepositions and conjunctions should have one letter length.

Not all books abide by these rules, unfortunately, but it's considered a sign of good typography. For example, one of the biggest Russian ebook sellers, LitRes, is using non-breaking spaces in FB2 files it produces. They also have an English website, litres.com.

dmalinovsky commented 5 months ago

Here's a sample FB2 file from LitRes: 70244713.fb2.zip

Note that it's using ASCII code 160 for non-breaking symbols, so you'll have to use hex editor or something similar to view it. They're added after 1 or 2 letter prepositions and short conjunctions ("а" and "и").

dmalinovsky commented 5 months ago

Okay, I found one exception which will be broken by my change. As @poire-z correctly noted, we should leave particles at the end of the line, and there's one 1-letter particle, "б".

Let me update the PR to explicitly list prepositions and conjunctions.

hius07 commented 5 months ago

Pages from academic classical books. I personally do not feel it as an issue.

dmalinovsky commented 5 months ago

Pages from academic classical books. I personally do not feel it as an issue.

Fair enough. If others feel the same, I'll close this PR.

ptrm commented 5 months ago

@poire-z

It should also just not be a question of taste - or it should be a taste shared by many. For Polish, there are indeed some state/academic documentation that specifies the letters that are prepositions - and K is among them, so they would have this issue with Kafka :) Pinging @ptrm : do such false positive happens often?

TL;DR: Cyrillic U+043A к is not Latin U+006B k, and the same goes for uppercase versions, so no risk of a false positive here :) I took a closer look at the comments and PR code only after writing the digressions below, but maybe they'll be of any use for future reference ;)

K is an archaic preposition, started disappearing in XVI century, so in this case no ;) Also initials are followed by a dot in Polish, so even "Józef W." or "Z." would not cause a false positive here in case of contemporary prepositions. I think a false positive here would require a very rare case, cause e.g. the "A" team is in quotation marks, etc. ;)

Then most publishers (can't even think of a counter-example) add non-breaking space after prepositions, so that never was an issue in my case. Also dangling prepositions are frowned upon, but not considered errors in Polish

dmalinovsky commented 5 months ago

Here's a recommendation from another well known resource about Russian language, its grammar, etc.: https://gramota.ru/spravka/vopros/294020

In general, yes, because it is extremely undesirable to leave one-letter prepositions and conjunctions at the end of a line that begin a sentence. By the way, in book publications it is not recommended to leave single-letter conjunctions and prepositions at the end of a line, even in the middle of sentences (in magazines, newspapers, information publications and publications of operational printing, this is allowed).

dmalinovsky commented 5 months ago

I wish there was a way to make it a local only change with custom hyphenation rules, but alas...

poire-z commented 5 months ago

We can try it - there's a few weeks until next KOReader release to see how good or bad it makes things.

May be let it only for "ru" - will it be ok to switch to typography ukrainian or belarussian to compare ? or are other typography rules like hyphenation different enough that other things will be at play and we won't be able to really compare ?

Can you fix the indentation for case 'К':and case 'к':?

Also, just for our culture of us non-cyrillic readers, could you add the english meaning of these preposition, as it was friendly done for Polish: https://github.com/koreader/crengine/blob/e4426ac5aefd28bb2adcfc25806701232361cbb2/crengine/src/textlang.cpp#L591-L597

@ptrm: I was just asking about any false positive (not really only about just Joseph K. :) and maybe there are people whose last name is just a single letter ). And maybe if there are false positive, they just get not noticed.

poire-z commented 5 months ago

I wish there was a way to make it a local only change with custom hyphenation rules, but alas...

Note that we have 3 lang tags we can force set for Russian, so you could apply your tweaks to only one of ru-GB or ru-US (dunno if these reaches deep down to our lang_tag here) - but then it won't really be tested.

https://github.com/koreader/koreader/blob/9387fcd2d0af29a0b915b53b598f561372522e88/frontend/apps/reader/modules/readertypography.lua#L87-L89

Not advising to do that, just mentionning it in case it gives other thoughts.

dmalinovsky commented 5 months ago

May be let it only for "ru" - will it be ok to switch to typography ukrainian or belarussian to compare ? or are other typography rules like hyphenation different enough that other things will be at play and we won't be able to really compare ?

Sure, I think I was too hasty to suggest extending it to other languages as well.

Can you fix the indentation for case 'К':and case 'к':?

Done.

Also, just for our culture of us non-cyrillic readers, could you add the english meaning of these preposition, as it was friendly done for Polish:

Done.

Also, is it okay to specify Cyrillic letters as UTF-8? Do I need to do something special for the encoding?

poire-z commented 5 months ago

Thanks.

Also, is it okay to specify Cyrillic letters as UTF-8? Do I need to do something special for the encoding?

I guess it's fine - it reads fine in Github web, and I guess you compiled and tested it and it works. (It's just me with my old latin1 environment that will see gibberish - but so is cyrillic to me anyway :))

I may push a PR in the coming days - so I'll merge this one then, if nobody else stops us here - and bump everything into KOReader.

dmalinovsky commented 5 months ago

I guess it's fine - it reads fine in Github web, and I guess you compiled and tested it and it works.

To be on the safe side, I've replaced raw letters with UTF-32 sequences. The same way is used for the quotes in the file anyway.

ptrm commented 5 months ago

@ptrm: I was just asking about any false positive (not really only about just Joseph K. :) and maybe there are people whose last name is just a single letter ). And maybe if there are false positive, they just get not noticed.

Yeah, and I think I answered about other single letters too, and still can't think of any cases. I think those false positives would be extremely rare, but sure, foreign names may cause such cases :)

And since we're talking corner cases, I guess checking if an uppercase letter is at the beginning of a sentence (otherwise not a preposition for sure) would be hard to do?

poire-z commented 5 months ago

Also, is it okay to specify Cyrillic letters as UTF-8? Do I need to do something special for the encoding?

I guess it's fine - it reads fine in Github web, and I guess you compiled and tested it and it works. (It's just me with my old latin1 environment that will see gibberish - but so is cyrillic to me anyway :))

Or I dunno. I remember seeing some non-ASCII char litteral quoted with L'x'. https://en.cppreference.com/w/cpp/language/character_literal May be use 0x0432 like elsewhere, and put the litteral cyrillic char in the // comment ?

To be on the safe side, I've replaced raw letters with UTF-32 sequences. The same way is used for the quotes in the file anyway.

oh, I see you just did that. I think U"" (double quotes) is for a string of multiple char, and U'' (single quote) is for a single char. May be just use 0x1234 (without any quote) to be similar to how non-ascii char are done elsewhere in textlang.cpp, just for consistency in style?

dmalinovsky commented 5 months ago

May be just use 0x1234 (without any quote) to be similar to how non-ascii char are done elsewhere in textlang.cpp, just for consistency in style?

I looked at line 304, for example, and copied it. Looks like there are 2 styles in the file. :)

poire-z commented 5 months ago

I guess checking if an uppercase letter is at the beginning of a sentence (otherwise not a preposition for sure) would be hard to do?

I think so - I also don't want to put too much (any :)) heuristics about grammar (what is a sentence start) and how much to look ahead/behind, what to skip, etc... in that low level code :).

poire-z commented 5 months ago

I looked at line 304, for example, and copied it. Looks like there are 2 styles in the file. :)

May be I copied that list from elsewhere, or I thought this list of quotes could be ready for multi-codepoints quotes when needed - even if there's only single-codepoints quotes in it currently.

dmalinovsky commented 5 months ago

May be I copied that list from elsewhere, or I thought this list of quotes could be ready for multi-codepoints quotes when needed - even if there's only single-codepoints quotes in it currently.

Makes sense. I've updated the style to match.

dmalinovsky commented 5 months ago

@poire-z, thank you! Can you please also merge https://github.com/koreader/koreader/pull/11570 later? It's a cosmetic change.

poire-z commented 5 months ago

Yes I will, just after the PR I'll make to bump all this crengine stuff, to keep things in a logical order.

poire-z commented 5 months ago

I'm not sure you did test these changes, so please do :) Our ota/nightly build download server is down. In the meantime, you can download manually a zip at: https://gitlab.com/koreader/nightly-builds/-/pipelines by clicking on the right of the top most/most recent pipeline:

dmalinovsky commented 5 months ago

I tested the last nightly, and this change works.

Before:

After:

Note that the circled preposition "в" moved from the end of the line to the next one. Also, the total page number stayed the same, so the effect from this hyphenation restriction is pretty small.

koreader / crengine

Add typography rules for Russian #557