w3c / alreq

Documenting gaps and requirements for support of Arabic and Persian on the Web and in eBooks.
Other
60 stars 31 forks source link

Line breaking rules #237

Open xfq opened 3 years ago

xfq commented 3 years ago

Reading the Chinese national standard GB/T 32411-2015 Information technology for the Uyghur, Kazagh, and Kirghiz editor common software, I noticed the following text:

No line should begin with period, comma, question mark, exclamation mark, exclamatory question mark, colon, dash, closing single quotation mark, closing double quotation mark, closing parenthesis, and closing book title mark.

No line should end with opening single quotation mark, opening double quotation mark, opening parenthesis, and opening book title marks.

I wonder if if Arabic/Persian has something similar. If so, I think we should document them (perhaps in § 4.1 Line breaking, see similar sections in clreq and jlreq).


By the way, should we document requirements in other languages using the Arabic script? For example, Arabic-derived Uyghur/Uighur requires marking of all vowels and uses hyphenation, which is different from Arabic and Persian.

asmusf commented 3 years ago

Those requirements can be generalized to:

  1. No line should begin with sentence, clause or phrase-ending punctuation.
  2. No line should end with sentence, clause or phrase-starting punctuation. (*)
  3. No paired punctuation should appear on a line that does not contain some of the contents enclosed by the pair.

(*) this rule applies, for example, to Spanish use of inverted question and exclamation marks - it's easier to treat them as an anti-parallel case to sentence-ending punctuation instead of having regular question and exclamation mark have a dual nature by sometimes treating them as part of a pair...

These rules can the be combined with those that govern whether and how words themselves can be broken.

xfq commented 3 years ago

A generalized summary would help, but I think we also need to write the requirements clearly, otherwise the implementers don't know which punctuations are starting/ending punctuations (I don't know Arabic, but for Chinese, I don't know if connector marks or interpuncts are considered "ending punctuations" or not). Although there are some data in UAX #14 and CLDR/ICU, these data are not necessarily accurate, and we can make them clearer in the requirements.

r12a commented 3 years ago

I wonder if if Arabic/Persian has something similar. If so, I think we should document them (perhaps in § 4.1 Line breaking, see similar sections in clreq and jlreq).

Indeed, that's one of the more obvious sections for which the task force didn't yet provide detail.

By the way, should we document requirements in other languages using the Arabic script? For example, Arabic-derived Uyghur/Uighur requires marking of all vowels and uses hyphenation, which is different from Arabic and Persian.

Certainly, but not in this document, whose scope was limited in the group charter to Arabic and Persian because they were similar and the group participants were not familiar with Uighur.

I'd certainly be interested in getting hold of a copy (in English) of the standard you mentioned, so that we can apply that information it contains to our language enablement program.

r12a commented 3 years ago

Actually, what needs to be said here is a little more complicated than listing characters that should or shouldn't appear at one particular end of a line. Fwiw, at https://r12a.github.io/scripts/arabic/#linebreak_props you can find a list of the default Unicode line-break properties for the list of (non-ASCII) characters that i think are needed for Arabic (not Persian) language support (slightly different from the list in alreq, which was more closely tied to CLDR). It's possible that tailoring needs to be applied to the list for Arabic language text.