Closed sujato closed 1 year ago
In fact it is better to do this on the back end, as sometimes text will be added/edited via the Github interface or locally.
We are having difficulty with \u2013 endash in German translations. In German, the endash is normally surrounded by spaces as in "Guten – Morgen". In translation these also manifest at the END of a segment where a paragraph is broken at the dash: "Aufwiedersehen –". In the latter case we have a problem in that the TTS is pronouncing the solitary endash or emdash as "minus". The fix is rather simple in that we should surround such an endash with non-breaking space. Therefore we request that non-breaking spaces not be removed automatically entirely. We need to allow for language differences.
This isn’t a language problem, it’s a problem with the User Agent. En-dash, em-dash, and minus are all distinct characters, and it is up to the UA to handle them properly.
Using no-break-space both before and after en-dash would create problems where you end up with extraordinarily—immoderately even—long sequences of unbreakable characters, which would of course be even worse in German. This may then create other undesirable consequences, such as hyphenating a dashed word—something that is strongly avoided in typography—or worse, just forcing random break points.
It may be justified to use a non-break space before an en-dash, as you never want to begin a line with a dash. Would this solve the problem with your UA?
Let's try an example and see what it looks like here. With no break space on both sides:
extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately –
Hyphenation isn't working, so it just breaks words randomly. This is not acceptable.
With a no break space only before the en-dash:
extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately – extraordinarily – immoderately –
Looks fine.
And for comparison, unspaced em-dash:
extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—extraordinarily—immoderately—
Also fine!
Good point about double being immoderately long. We'll try to go with a single solution if we can't think of anything better. For Bilara, I think a simple warning about would suffice--it would focus attention but not forbid use. Thanks, Bhante.
It is very common in raw texts for punctuation and spacing to be used in inconsistent and incorrect ways. We should process the text in bilara-data so that it forces correct and consistent usage as far as possible.
These transformations should be applied in the
published
branch. Any updates to published texts should trigger the corrections again.Certain transformations may not be universally applicable. For example, inserting soft-hyphens (to allow long Pali words to break) should be per-application. These should not be done in bilara-data.
List of changes
Thus we should transform:
We should also force correct spacing:
.,:;?!—-
.,:;?!
1:120
,1.23
,3,4
(the latter in European languages.)A.B.C.
—-
En-dash (–) requires careful handling. It is normally used to indicate a range, where it has no space before or after. Normally a range is easily detected
\d–\d
, however it may include alphabetic characters (an 3.2–an3.7). En-dash is also sometimes used to indicate a break in text. SC house style mandates em-dash for this in English, but in German – and perhaps other languages – spaced en-dash is used instead. One problem with such usage is that it can leave a dash to start a line. Thus it may be preferable to use no-break space before the dash.Languages
The transformations should be designed for English, then apply by default for all languages. However we should also provide a language override, where any of the rules can be overridden.
Quotes, for example, are language-specific. German is:
CJK languages use a special space. This should be inserted at the end of segments instead of the standard space.
https://unicode-table.com/en/3000/
Application
The classic library for this is smartypants.pl.
https://daringfireball.net/projects/smartypants/
There are many ports in different languages, we should use a modern update, for example:
https://github.com/othree/smartypants.js/
This should be done via Github Actions, and part of the bilara-data checking suite:
https://github.com/suttacentral/bilara-data/issues/181
Miscellaneous
Other random errors that could be transformed: