w3c / jlreq

Text Layout Requirements for Japanese
https://w3c.github.io/jlreq/
Other
101 stars 17 forks source link

How to do bunsetu-separated rendering #17

Open r12a opened 6 years ago

r12a commented 6 years ago

Makoto Murata is working on Accessibility Requirements on Japanese Typography.

He says the following about adding space between bunsetsu (word-like phrases in Japanese - see https://en.wikipedia.org/wiki/Japanese_grammar#Sentences,_phrases_and_words):

This is meant to help students with dyslexia. When Japan creates the next set of DAISY textbooks, we hope to use a single EPUB3 document for general-ruby/para-ruby/no-ruby rendering (総ルビ/パラルビ/ルビ無し)as well as space-separated-bunsetu rendering and normal-rendering (分かち書き/普通の表示). Moreover, since DAISY textbooks are for students, 分かち書き has to be perfect (the morphological analysis may provide an incorrect result).

I can also imagine that Mainland China and Taiwan also have students with dyslexia and that the same mechanism might be useful.

Normal Japanese/Chinese text does not, in itself, indicate break opportunities for bunsetsu separation. It will be necessary to provide a mechanism that allows bunsetsu separation to be applied to normal text.

I have a number of questions around the topic:

  1. Murata-san, is bunsetsu-spacing a recognised and widely used technique in existing text? Or is this a new idea?

  2. Will this mechanism will be different from the way line-breaking occurs in Japanese, since the grammatical particles are considered part of the bunsetsu unit.

  3. Would we be looking at a new CSS property? Styling seems appropriate, since the intent appears to be to use the text as normal elsewhere, and to apply the accessibility changes to existing text (ruling out the possible use of spaces, zero-width or otherwise).

  4. Given a new property for bunsetsu spacing, will it be necessary to change default line-breaking and justification behaviour, since presumably (?) the gaps will count as word separators.

murata2makoto commented 6 years ago

Murata-san, is bunsetsu-spacing a recognised and widely used technique in existing text? Or is this a new idea?

No, it has been widely used in the first or second grade students in elementary schools for years. It is also useful for those students who have some problems such as Dyslexia. But the idea of using a single source for both wakachi-gaki-rendering and non-wakachi-gaki rendering is new.

Will this mechanism will be different from the way line-breaking occurs in Japanese, since the grammatical particles are considered part of the bunsetsu unit.

The JLreq WG of APL is discussing this topic. We will get back to you soon.

Would we be looking at a new CSS property? Styling seems appropriate, since the intent appears to be to use the text as normal elsewhere, and to apply the accessibility changes to existing text (ruling out the possible use of spaces, zero-width or otherwise).

Bunsetsu-based spacing and bunsetsu-based line-breaking are related but they are different.

I guess that we need a CSS property for enabling/disabling bunsetsu-based line-breaking.

Given a new property for bunsetsu spacing, will it be necessary to change default line-breaking and justification behaviour, since presumably (?) the gaps will count as word separators.

I think that words-spacing of CSS is good enough for bunsetsu-based spacing. But we need a Unicode character as a boundary.

duerst commented 6 years ago

@murata2makoto: You say "I think that words-spacing of CSS is good enough". I understand this to mean that it is possible to switch between spaced (for low grades and accessibility) and non-spaced (for general readers) display using CSS. I can imagine this to work in theory with word-spacing: -100 (see https://www.w3.org/TR/css-text-3/#propdef-word-spacing), but this would have 3 problems: 1) Percentage values for the word-spacing property are currently at risk. 2) The spec says that there may be implementation-dependent limits for negative values. 3) Old browsers don't die out very quickly, so deployment on the general internet may be very slow. (This would be different if the property would be needed for the 'special' case, where interested readers may be motivated to upgrade.) If you know of another property that would work, I'd like to know. Or did you just mean that having a new CSS property (adequately defined) would be enough to solve the problem? As a Unicode character, I guess U+200B ZERO WIDTH SPACE would be my first candidate.

murata2makoto commented 6 years ago

@duerst First of all, I am open to suggestions. At this stage, I would like to first make the requirements very clear. Comments on Accessibility Requirements on Japanese Typography are very welcome.

Having said that, I actually assumed some Unicode character that would occupy zero-width when letter-spacing and word-spacing are both normal. I did not assume negative values for these properties.

upsuper commented 6 years ago

My understanding is that word-spacing sounds like a good fit in this case, and U+200B does sound like the right thing to use for word boundaries. However, the current CSS Text spec says:

or if a word-separating character has a zero advance width (such as the zero width space U+200B) then the user agent must not create an additional spacing between words.

It's probably worth understanding why it says that, and whether there is going to be any webcompat impact if we change that behavior.

r12a commented 6 years ago

I can imagine this to work in theory with word-spacing:

(not sure what the -100 means, but i think it's just a cut&paste glytch)

I'm not so sure. The word-spacing property currently allows quantitative control over spacing between words when a word separator character has been identified.

I'm thinking off the top of my head here, but I think it would be better to define a qualitative switch that says "turn on word separation for scriptio continua scripts". This would then allow us to apply accessibility improvements to ordinary text that hasn't been specially prepared in advance (ie. with insertion of ZWSP or whatever). It would also allow us to apply the same property to those SE Asian scripts where we also cannot expect people to insert ZWSP as a general rule.

Let's suppose we invent a new property called create-word-breaks. We could then have values that allow for definition of a word as a syllable, a short definition of a word, or a long definition (these are things that the SE Asian group is already thinking about for the needs of Thai and Khmer). We could then perhaps continue to use the word-spacing property to indicate the extent to which the words are separated.

We could also define create-word-breaks so that it changes the behaviour of word-spacing relative to ZWSP. This may tie in with a need being formulated in the SE Asian group to support line-breaking on ZWSP and ignore dictionaries for less developed languages where the available dictionary (for the standard language associated with a script) doesn't work (eg. Shan vs Burmese).

This is just brainstorming at this point.

kojiishi commented 6 years ago

It's not only for accessibility for Japanese either. I heard similar requests for presentation slides or short text in UI, where people wants line breaking only at "word" boundary. In the example below, the 2nd item from the bottom breaks early because "ニュース/スタンド" looks much nicer than "ニューススタ/ンド".

image

This site does this by:

<span style="word-break: keep-all">ニュース&#8203;スタンド</span>

This is just brainstorming at this point.

Me too, agree it's great if we can solve nicely. Maybe this has some similarity with the text-wrap property? We tentatively have balance and multi-line, but ppl seem to want many different ways to do it, probably more than all browsers can provide interoperably. That part seems similar to this issue to me.

murata2makoto commented 6 years ago

A recently announced DAISY reader supports bunsetsu-based line breaking and bunsetsu spacing.

http://www.plextalk.com/jp/education/products/e-reader/

I heard from the developers that they use morphological analysis and some manual adjustment for creating HTML markup that represents bunsetsu boundaries. Then, their reading system uses such HTML for bunsetsu-based line breaking and bunsetsu spacing.

macnmm commented 5 years ago

inserting extra characters could have the side-effect of breaking existing mojikumi spacing, in that the adjacent character class logic looks at the unicode of the space and not the character after it.

I agree the application of such a feature is useful for display type usage or social media graphics type layout, where breaking short lines on linguistic boundaries is more desirable than breaking anywhere. In such applications we are considering running the text through linguistic analysis to determine "desired" line breaks in addition to the strictly legal ones. If you put a special linguistic break marker (ignored by mojikumi processing) similar to how hyphens are inserted (and show or hide optionally like hyphens), that could work...

murata2makoto commented 5 years ago

@macnmm wrote:

inserting extra characters could have the side-effect of breaking existing mojikumi spacing, in that the adjacent character class logic looks at the unicode of the space and not the character after it.

@frivoal, @fantasai, @r12a and other APL members discussed about this. We are inclined to use <wbr> elements and the zero-width space of Unicode. Will they cause problems?

kojiishi commented 5 years ago

We are inclined to use <wbr> elements and the zero-width space of Unicode.

Did you mean "<wbr> or zero-width space", or really using both of them?

<wbr> sounds reasonable to me.

murata2makoto commented 5 years ago

Yes, both. @frivoal, could you explain why both?

asmusf commented 5 years ago

However, the current CSS Text spec says:

or if a word-separating character has a zero advance width (such as the zero width space U+200B) then the user agent must not create an additional spacing between words.

Probably because the standard use of 200B is to mark invisible word boundaries and by default you don't want to add inter-word space there (The Unicode Standard suggests that adding inter-character space is expected, e.g. in justification).

Not sure whether the CSS wording would need to remain if a new CSS parameter were added that explicitly calls for increased spacing, as long as the default remained.

frivoal commented 5 years ago

Yes, both. @frivoal, could you explain why both?

We meat that implementations would have to support both, not that authors would have to use both together. Authors can use either.

As to why:

I've made a quick-and-dirty draft specification based on the discussion we had in Tokyo last month, including a few examples. Please have a look: https://specs.rivoal.net/css-space-expansion/

inserting extra characters could have the side-effect of breaking existing mojikumi spacing, in that the adjacent character class logic looks at the unicode of the space and not the character after it.

We should absolutely make sure that this is not the case. That sounds like an addition/clarification to https://drafts.csswg.org/css-text-4/#text-spacing-property

kojiishi commented 5 years ago

Just learned an open source library budou can help this topic. /cc @tushuhei

kojiishi commented 5 years ago

Question. I understood you want to use either ZWSP or <wbr> to insert a break opportunity. How are you planning to prevent break opportunities normally available?

I was guessing that you're planning to use word-break: keep-all, but I found that it cannot handle, for instance, "365 日の" or "Windows では" to be non-breakable. These "bunsetu" may even include a space character as in these examples.

duerst commented 5 years ago

I was guessing that you're planning to use word-break: keep-all, but I found that it cannot handle, for instance, "365 日の" or "Windows では" to be non-breakable. These "bunsetu" may even include a space character as in these examples.

What about using a nonbreaking space in these cases?

frivoal commented 5 years ago

Using &nbsp; (or the thinner NARROW NO-BREAK SPACE (U+202F)) is probably a pragmatic solution right now. Longer term, I think there should actually be no space character of any kind in the markup, and we should instead use text-spacing: ideograph-alpha and `text-spacing: ideograph-numeric to control this visual separation at layout time.

kojiishi commented 5 years ago

Sorry my question was misleading.

I understand people here do not want to use <span>s to wrap "bunsetu", and use <wbr> or ZWSP to create break opportunities between "bunsetsu".

What is the recommended way to prohibit normal break opportunities within "bunsetu"? Not only spaces, "bunsetsu" can include "365日の" (without spaces) or EAW=A characters, which keep-all cannot prohibit break opportunities.