nlbdev / nordic-accessible-epub-guidelines

2 stars 2 forks source link

Consider clarifying the treatment of text capture of end-of-line syllabification and its proper normalisation #14

Open AndersEkl opened 1 year ago

AndersEkl commented 1 year ago

Originally written by @martinpub

Words split in two across a line break are a common typographic convention. The proper normalisation of these should probably be added to an update of the guidelines.

martinpub commented 1 year ago

Martin, Oscar, and Anders discussed this. Low priority. Possibly the two normalisation ways would be to 1. always preserve the hyphenation. or 2. always remove end-of-line hyphenation and merge the two word parts. More sophisticated ways would require deeper language skills.

TorilBWM commented 1 year ago

We think absolutely 2. always remove end-of-line hyphenation and merge the two word parts.

josteinaj commented 1 year ago

A third alternative could be to encode end-of-line hyphens as soft hyphens.

martinpub commented 1 year ago

A third alternative could be to encode end-of-line hyphens as soft hyphens.

Very good suggestion @josteinaj. In many cases, this will be effectively the same as 2, right? With the addition that the hyphens can be retrievable for processing/checking, since they are disambiguated from other/hard hyhpens.

josteinaj commented 1 year ago

Yes, they would be disambiguated from other/hard hyphens. It would improve line breaking in a normal e-reader, but also for other formats: a TTS engine should ignore them, and a braille layout engine could use it for hyphenation across lines. It wouldn't be a 100% accurate representation of the original though. Sometimes it really is a hard hyphen, even though it's at the end of the line. But that's a problem also for option 2 when deleting hyphens. No perfect solution here :shrug:.

martinpub commented 1 year ago

So, let's add @josteinaj's suggestion as a third option. In my view, this is the best option. Thanks Jostein!

martinpub commented 1 year ago

Note to selves: We should check how the spell checkers we use treat soft hyphens.