w3c / sealreq

Southeast Asian layout task force
34 stars 5 forks source link

Do Javanese & Balinese lines break at syllable or word boundaries? #2

Open r12a opened 6 years ago

r12a commented 6 years ago

The information i have come by so far indicates that Balinese, Javanese, and Batak have break opportunities between each syllable, rather than strongly preferring breaks between words like Thai and Khmer do.

Which of these is true?

  1. You can break text at line end for Balinese, Javanese, and Batak between syllables without being concerned about word boundaries.
  2. There is a preference for breaking at word boundaries, but breaking at syllable boundaries is also common.
  3. Text in those scripts should always break recognisable words at word boundaries.

Since a language like Javanese typically uses disyllabic word roots, are there differences between the expected behaviour of word roots and other parts of the text?

Where syllable-final consonants and syllable-initial consonants can stack, or take on special conjoined forms, do those ever get split? ie. are we talking about orthographic syllables here, rather than phonetic ones?

What happens if you have a longish string of text where all syllables are connected by stacking or special conjoined forms?

Advice (especially with examples) would be very much appreciated.

NorbertLindenberg commented 6 years ago

I’d suggest splitting this into separate questions for the scripts of Indonesia, as the answers are likely to differ. Batak is quite separate from Javanese and Balinese. Sundanese would be closer, but differs from Javanese and Balinese already in its use of spaces and lack of conjunct forms in modern writing.

adtbayuperdana commented 6 years ago

Speaking for Javanese, I think line break can occur between syllables without being concerned about word boundaries. However, there's a special caveat that it may not occur it such a way that would make a pangkon occur in the middle of the sentence.

line break in javaansche brieven

this is a random paragraph from Javaansche brieven berigten verslagen (1845). Virtually all of the lines features line breaks within the end word: 1) ꦩ-ꦠꦼꦁ (ma-teng) 2) ꦲꦸꦠꦸꦱ꧀ꦱꦤ꧀ꦤꦶ-ꦥꦸꦤ꧀ (utusassanni-pun, from the individual word ꦲꦸꦠꦸꦱꦤ꧀ utusan, and ꦲꦶꦥꦸꦤ ipun) 3) ꦠꦸ-ꦮꦤ꧀ (tu-wan) 4) ꦔ-ꦤ꧀ꦤ (nga-nna. in this example, it cannot be broken into ꦔꦤ꧀-ꦤ ngan-na because it would trigger a pangkon in the middle of a sentence) 5) ꦏ-ꦮꦺꦴꦤ꧀ (ka-won) 6) ꦫꦩ꧀ꦥꦸꦁ-ꦔꦤ꧀ (rampung-ngan) 7) ꦏꦸ-ꦭ (ku-la)

r12a commented 6 years ago

ꦏ-ꦮꦺꦴꦤ꧀ (ka-won)

I understand that in some printed material, when a new line begins with ◌ꦺ [U+A9BA JAVANESE VOWEL SIGN TALING], an additional spacing one is placed at the end of the previous line. This seems to be an example of that feature. Is that correct?

I wonder how one is expected to produce that. (Especially for the Web) I assume that it must be dependent on the algorithm used for the justification process, since widening or narrowing the browser window or margins will change the location of the break, and the duplication of taling we see here would only be appropriate if the window was just the right width.

[update] Note that it would be difficult to produce the text above otherwise, because the taling appears to the left of its base character, not to the right.

adtbayuperdana commented 6 years ago

Yes, it is that. This is a historical use that can be found easily in colonial era printed books and handwriting. However, I haven't found any contemporary texts that incorporate this as it is not mentioned in modern standards whether this behavior is mandatory or discretionary.

No idea how to produce this digitally with the current Javanese codepoints, unfortunately. Once, a font I made use a very quick fix by using ᭢ [U+1B62 BALINESE MUSICAL SYMBOL DENG] with a Javanese TALING glyph, since the symbol is not combining and the glyph is canonically equivalent to taling.

adtbayuperdana commented 6 years ago

redun taling2 redun taling redun taling1

Example of line breaking TALING in 1800s handwriting. It seems that this behavior is not a quirk of metal typesetting since handwritten examples can also be found

NorbertLindenberg commented 6 years ago

Taling duplication would have to be implemented in a way similar to hyphenation – the additional taling can’t be part of the text, but has to be added by the line break algorithm when the line breaks before a syllable that has a taling.

As Bayu implies, one complication is that you can’t actually just insert taling itself, because it would either become part of the previous syllable and get reordered during font rendering, or it would be flagged as misplaced with a dotted circle. The Unicode standard (page 674 of version 10) suggests inserting U+00A0 NO-BREAK SPACE before the taling. This character is a valid base, and taling would get reordered around it so that there's no visible space in the text, but the additional space might confuse justification.

NorbertLindenberg commented 6 years ago

If we accept that lines break at syllable boundaries, we need to clearly define what constitutes a syllable. The definition in the Unicode standard (page 673 of version 10 for Javanese) doesn’t match the one used by the Universal Shaping Engine, and actual usage in the real world may not match either. For example, I was given the word ᬫ᭄ᬩᬓ᭄ as an example for a single-syllable word, but the USE considers it two syllables: ᬫ᭄ᬩ + ᬓ᭄. For ᬢᬾᬫ᭄ᬧᭀ, I was given the possible line break ᬢᬾᬫ᭄ + ᬧᭀ, while the USE would break ᬢᬾ + ᬫ᭄ᬧᭀ.

adtbayuperdana commented 6 years ago

After scanning Javanese books again, it seems I was wrong. Many times, line breaks do not occur at the syllable or word boundries. For example here in De Bråtå-Joedå de Råmå en de Ardjoena Såsrå (1845)

untitled-1

The first instance is two words, aken perang (ꦲꦏꦼꦤ꧀ꦥꦼꦫꦁ) which are broken into ake-nperang (ꦲꦏꦼ​-ꦤ꧀ꦥꦼꦫꦁ) The second is pangandika (ꦥꦔꦤ꧀ꦢꦶꦏ) which is broken into panga-ndika (ꦥꦔ-ꦤ꧀ꦢꦶꦏ)

I don't know how to describe this, but as long as there is no pangkon, anything goes it seems.

NorbertLindenberg commented 6 years ago

ꦲꦏꦼ-ꦤ꧀ꦥꦼꦫꦁ and ꦥꦔ-ꦤ꧀ꦢꦶꦏ are exactly what I'd expect from looking at the USE syllable definition.

Maybe Javanese simply has different orthographic and phonologic syllables? In speaking, it seems natural to treat the /n/ as a final consonant, but in writing the ꦤ are the base consonants for the clusters they're in.

adtbayuperdana commented 6 years ago

In speaking, it seems natural to treat the /n/ as a final consonant, but in writing the ꦤ are the base consonants for the clusters they're in.<

I'm inclined to agree with this.

r12a commented 6 years ago

@adtbayuperdana the type of breaking you are describing is what i was referring to above when i said 'orthographic'. It is common in scripts descended from Brahmi. For example, the word Hindi in hindi contains the phonetic syllables hin-di, but the orthographic syllables hi-ndi, meaning that where you to split the word across lines you'd move the n to the second line.

I should clarify that i'm referring to the word 'hindi' when written as हिन्दी (with a conjunct), rather than हिंदी (using an anusvara, where the orthographic syllables would be the same as the phonetic). Both spellings are possible for this word in Hindi.

adtbayuperdana commented 6 years ago

@r12a I see, my mistake then. I did not fully understood what you meant by orthographic syllable at the time.

adtbayuperdana commented 6 years ago

@r12a Bit of a nitpick: There's this bit in the hyphenation section of Javanese gap analysis:

There is a feature in use in print sometimes when a line starts with...

and this in linebreaking:

In 19th century texts, when a new line begins with...

Since hyphenating taling turns out can also be found in handwritten manuscripts, and I've recently founded such taling from as late as 1960s handwritten Primbon, maybe we should say that it is common in printed and handwritten colonial era texts, but not used by contemporary standards. Something to that effect.

eteo commented 6 years ago

It might be interesting to take a look at the convention used in a short story book entitled Kiambang that was published in October 2015, in Javanese script by Adien Gunarta.

I wonder ...

  1. If the line breaks used in this short story is based on Indonesian language convention?
  2. What software did the publisher use to set the pages? How did they do the line breaks?
  3. What are the rules that run in the background that determines where the justification and line breaks will be? I can't imagine doing them manually for 146 pages...

A digital copy of the entire book (146 pages) is available on Scribd here: https://www.scribd.com/document/348930669/KIAMBANG-oleh-Adien-Gunarta

screen shot 2018-06-06 at 9 23 38 pm

screen shot 2018-06-06 at 9 17 39 pm

screen shot 2018-06-06 at 9 18 10 pm

screen shot 2018-06-06 at 9 18 21 pm

@adtbayuperdana: Looks like your Pustaka and Aturra fonts were used in this book. screen shot 2018-06-06 at 9 22 05 pm

adtbayuperdana commented 6 years ago

@eteo I did help Adien to write and provide fonts for this book actually, though only a little. He was the one who came up with the idea and went through the typing, layout, publishing, and such. Though it should be pointed out that Kiambang is actually written in the Indonesian language with the Javanese script, rather than Javanese language and script. At the time, we were both still familiarizing ourselves with the Javanese orthography and may not yet aware of certain minutiae; though from rereading it, it is to my understanding that it does not have glaring errors.

line breaking-wise, Kiambang mostly uses orthographic syllables in line breaks. Though there are some weird bit like Pada Lungsi that wraps to second line in a single sentence, or (unrelated to line break) the use of Ra Agung for proper geographic name.

r12a commented 6 years ago

@adtbayuperdana wrt

Bit of a nitpick:

Thanks. I moved all that into the hyphenation section, and added your comment.

r12a commented 6 years ago

So my current theory is the following:

Scripts that don't separate words with spaces (or anything else), but that stack or conjoin consonant clusters include Myanmar, Khmer, Javanese & Balinese. In the case of Myanmar and Khmer, the stacking behaviour only really occurs within a word (could be multisyllabic word or initial or final cluster). In Javanese & Balinese, however, the stacking can span word boundaries, such as in ꦥꦔꦤ꧀ꦢꦶꦏ (pangan dika) and ᬧᬓ᭄ᬭᬫᬦ᭄ (pak Raman).

In a situation where stacked/conjoined clusters span word boundaries, they cannot be separated, therefore it is much more likely that you NEED to break on syllable boundaries, rather than word boundaries. Since such stacks don't usually span word boundaries in Myanmar and Khmer, it's easy to wrap at word boundaries instead.

Does that make sense?

adtbayuperdana commented 6 years ago

So my current theory is the following:

I think it make sense to me

adtbayuperdana commented 6 years ago

While browsing through British Library's Balinese lontar, I came across these use of pameneng at line breaks:

line break in balinese

It seems that Balinese bisah (ᬄ), which is equivalent to Javanese wignyan (​ꦃ), is able to stand alone unattached to a base with the use of pameneng. In this case, Balinese line breaks may not be strictly orthographic like Javanese.

The lontar in question, Or 14022 depicts scenes from Ramayana, copied in 1975.

r12a commented 5 years ago

For a page that lets you experiment with line-breaking in Javanese on various browsers, see https://w3c.github.io/i18n-tests/css-text/line-breaking/exp-jv-line-break-000.html, https://w3c.github.io/i18n-tests/css-text/line-breaking/exp-jv-line-break-zwsp-000.html and https://w3c.github.io/i18n-tests/css-text/line-breaking/exp-jv-line-break-textarea-000.html

For related experimental tests, see https://w3c.github.io/i18n-tests/results/exploring-linebreak

One noteworthy item i came across is that normal HTML text doesn't break lines in Javanese, but text in a textarea form does! Whether or not it does so correctly is something i'd like to know.

NorbertLindenberg commented 4 years ago

Line breaks within an orthographic syllable, as discussed by @adtbayuperdana for Balinese above, seem to occur in Javanese too. In the book Kilas balik kelengkapan aksara Jawa dari masa ke masa I see a few cases: – page 46, before postjoined consonant ◌꧀ꦥ – page 51, before postjoined consonant ◌꧀ꦱ – right below, before final consonant ◌ꦃ

Is this common and covered by any standard, or is it just the author’s personal style?

NorbertLindenberg commented 4 years ago

@r12a I see different behavior in different browsers for Javanese line breaking in textarea. Firefox 73 does a pretty good job breaking at the boundaries of orthographic syllables unless the text area gets too small even for individual syllables. Safari 13.0.5, on the other hand, doesn’t always split ꦭꦲꦶꦫꦏꦺ, ꦏꦺꦏꦤ꧀ꦛꦶꦩꦂ, ꦠꦧꦠ꧀ꦭ, and ꦏ꧀ꦲꦏ꧀ꦏꦁꦥ into their component syllables, but does sometimes split the syllables ꦤ꧀ꦢꦂ, ꦤ꧀ꦲ, and ꦏ꧀ꦲ after the virama, and ꦮꦺꦴꦁ before the final consonant.

NorbertLindenberg commented 4 years ago

For Balinese, Ida Bagus Adi Sudewa’s description of the script says it traditionally was “common practice to break the sentence at any places”. Another expert also said that each horizontal character is standing alone, and gave me the following examples for line breaks:

  1. ᬫ᭄ᬩᭀ ᬾ᭠ + ᬫ᭄ᬩᬵ ᭢ᬫ᭄ᬩ᭠ + ᬵ
  2. ᬫ᭄ᬧᭀ ᬾ᭠ + ᬫ᭄ᬧᬵ ᬫ᭄ᬧᬾ᭠ + ᬵ ᬫᬾ᭠ + ᭄ᬧᬵ
  3. ᬧ᭄ᬭᬲᬵᬤ ᬧ᭄ᬭ᭠ + ᬲᬵᬤ ᬧ᭄ᬭᬲ᭠ + ᬵᬤ
  4. ᬓᬫᭀᬓ᭄ᬱᬦ᭄ ᬓ‌ᬾ᭠ + ᬫᬵᬓ᭄ᬱᬦ᭄ ᬓᬫᬾ᭠ + ᬵᬓ᭄ᬱᬦ᭄ ᬓᬫᭀᬓ᭠ + ᭄ᬱᬦ᭄ ᬓᬫᭀᬓ᭄ᬱᬦ᭠ + ᭄

Note that finding some of these breaks, the ones that separate ᬾ from the surrounding cluster, requires knowledge of vowel reordering. Note also that a number of them cause OpenType renderers to insert dotted circles, which would have to be worked around in some fashion.

I.B. Adi Sudewa also says:

For modern writing, the following rules of thumb should apply:

  • No line breaks allowed between syllable and any of its signs
  • No line breaks allowed just before a colon, comma or full stop
NorbertLindenberg commented 4 years ago

I looked through the three volumes of Mardi Kawi to see its line breaking behavior (it’s mostly written in Javanese script). I did not see a single instance of line breaks within an orthographic syllable. I did see a few instances of taling duplication.

r12a commented 1 month ago

Here's a finer point that i'd like to get clarity on, if possible.

There are phonetic syllables in Javanese that end with a coda, such as -ꦤ꧀. If the word occurs at the end of a sentence the pangkon remains visible. If the coda glyphs stray across the line end is the code wrapped alone to the next line, or is the expectation to wrap it with the preceding orthographic syllable?

For example, which of the following best represents how ꦲꦔꦶꦤ꧀ is wrapped?

  1. ꦲꦔꦶ-ꦤ꧀
  2. ꦲ-ꦔꦶꦤ꧀
  3. ꦲꦔꦶꦤ-꧀

I'm guessing that the coda is wrapped alone, but just wanted to check because of example 4 in Norbert's comment.

Same question for Balinese, eg. ᬅᬗᬶᬦ᭄.

adtbayuperdana commented 1 month ago

Here's a finer point that i'd like to get clarity on, if possible.

There are phonetic syllables in Javanese that end with a coda, such as -ꦤ꧀. If the word occurs at the end of a sentence the pangkon remains visible. If the coda glyphs stray across the line end is the code wrapped alone to the next line, or is the expectation to wrap it with the preceding orthographic syllable?

For example, which of the following best represents how ꦲꦔꦶꦤ꧀ is wrapped?

  1. ꦲꦔꦶ-ꦤ꧀
  2. ꦲ-ꦔꦶꦤ꧀
  3. ꦲꦔꦶꦤ-꧀

I'm guessing that the coda is wrapped alone, but just wanted to check because of example 4 in Norbert's comment.

Same question for Balinese, eg. ᬅᬗᬶᬦ᭄.

I have seen 1 and 2 for Balinese and Javanese, I don't recall ever seeing 3 in Javanese but for Balinese... I think I saw it once in lontar but otherwise 1-2 is the most common

r12a commented 3 weeks ago

Thanks @adtbayuperdana. That says to me that it isn't necessary to keep full phonetic syllables together when line breaking, but that the pangkon should typically be kept with any syllable coda.