metanorma / metanorma-standoc

Metanorma for Standoc documents
BSD 2-Clause "Simplified" License
5 stars 2 forks source link

Support Japanese ruby #73

Closed opoudjis closed 7 months ago

opoudjis commented 5 years ago

Support Japanese ruby markup via https://github.com/pasberth/asciidoctor-html5ruby, per https://github.com/pasberth/asciidoctor-html5ruby

opoudjis commented 4 years ago

Need progress update on this.

ronaldtse commented 4 years ago

@opoudjis the task is not so simple, the cited code is literally 24 lines long and is a simple interpretation of what a Ruby is:

https://github.com/pasberth/asciidoctor-html5ruby/blob/0831824d1e19e2f972de303e1fcc7777caf33995/lib/asciidoctor-html5ruby/extension.rb#L1-L24

require 'asciidoctor'
require 'asciidoctor/extensions'

include ::Asciidoctor

class HTML5RubyMacro < Extensions::InlineMacroProcessor
  option :pos_attrs, ['rpbegin', 'rt', 'rpend']

  def process parent, target, attributes
    if attributes.size == 2 and attributes.key?(1) and attributes.key?("rpbegin")
      # for example, html5ruby:楽聖少女[がくせいしょうじょ]
      rt = attributes[1]
      rt ||= ""
      rpbegin = '('
      rpend = ')'
    else
      rpbegin = attributes['rpbegin']
      rt = attributes['rt']
      rpend = attributes['rpend']
    end

    %(<ruby>#{target}<rp>#{rpbegin}</rp><rt>#{rt}</rt><rp>#{rpend}</rp></ruby>)
  end
end

Ruby is also not just "Japanese", it applies to Chinese, Korean and Vietnamese as well.

HTML

We will need a Ruby encoding mechanism in XML as well as in Word, etc. This we have to update the lightweight document model ("BasicDoc") to support this.

HTML 5 is a presentation oriented language. Ruby is a "presentation format". But Metanorma is not.

There are two types of things that are presented in Ruby format:

Metanorma should clearly separate the encoding of them. In HTML, they can be rendered as HTML Ruby.

explain:[Hello][This is a form of greeting.]

// [{text}][{reading}][{script code of reading}]
label:日本語[lang=jp,script=Hira,reading=にほんご]

// Laozi's Daodejing has two famous commentaries.
// This is the first line of the text
// (4 phrases in 1 line, split for clarity)
[[tag1]] reading:道[dao] reading:可[ke] reading:道[dou]
[[tag2]] 非常道 
[[tag3]] 名可名 
[[tag4]] 非常名

// Commentary from HE Shanggong per phrase
[annotate,from=tag1,to=tag2,series=河上公]
謂經術政教之道也

[annotate,from=tag2,to=tag3,series=河上公]
非自然生長之道也。常道當以無為養神,無事安民,含光藏暉,滅跡匿端,不可稱道

[annotate,from=tag3,to=tag4,series=河上公]
謂富貴尊榮,高世之名也

[annotate,from=tag4,to=tag5,series=河上公]
非自然常在之名也。常名當如嬰兒之未言,雞子之未分,明珠在蚌中,美玉處石間,內雖昭昭,外如愚頑。

// Commentary from WANG Bi per line
[annotate,from=tag1,to=tag5,series=王弼]
可道之道,可名之名,指事造形。非其常也,故不可道、不可名也

Does something like this work?

In the HTML 5 Ruby scheme, they use the <rt> tag for pronunciation, and <rtc> tag for annotated meaning. In East Asian text, it is common to have BOTH at the same time. "Pronunciation" can span multiple characters, as can "annotated meaning".

This scheme of:

html5ruby:楽聖少女[がくせいしょうじょ]
html5ruby:{text}[{pronunciation}]

Is a bit simplistic even compared to HTML 5.

Here's an example of semantic ruby (https://developer.mozilla.org/en-US/docs/Web/HTML/Element/rtc)

Screen Shot 2019-09-10 at 9 57 10 AM

To be on par with HTML Ruby:

opoudjis commented 4 years ago

@zoras Progress report?

zoras commented 4 years ago

@opoudjis pls see the PR https://github.com/metanorma/metanorma-standoc/pull/117/files Not sure if it's correctly working though.

opoudjis commented 4 years ago

Have got it working with trial and error --- macros not being well documented. I will fill out the rest of the functionality of ruby, but it's not urgent.

opoudjis commented 7 months ago

Ok, having read up on Ruby:

The markup you are proposing is nightmarishly complicated, and users will reject it. Metanorma can support a more semantic take on markup, without the Asciidoctor markup following suit.

We want:

That gives us the following Semantic XML:

ruby = element ruby {
  ruby_pronunciation*,
  ruby_annotation*,
  text | ruby
}

ruby_pronunciation = element pronunciation {
  attribute value { text },
  attribute script { text }?, # ISO code
  attribute lang { text }?, ISO code
}

ruby_annotation = element annotation {
  attribute value { text },
  attribute script { text }?, # ISO code
  attribute lang { text }?, ISO code
}

The base text is all that is left in Plain text, which we should get by stripping tags; hence ruby/pronunciation/@value and ruby/annotation/@value. (I left index terms as tag content instead of attributes, and that caused a whole lot of exceptions.)

It seems that Ruby nowadays is done in HTML as nested ruby, with no rb rbc rtc. The picture painted in https://strictquirks.nl/standards/the-situation-with-ruby-2020.xhtml is damning, and for us to support bizarre partial spans with rbspan is a dead end. The Living HTML spec (https://html.spec.whatwg.org/multipage/ ) has now dropped all but rb rp.

Btw you were wrong about

they use the <rt> tag for pronunciation, and <rtc> tag for annotated meaning

They didn't differentiate pronunciation and meaning, they just used rtc for tabular markup of ruby, as a way of grouping multiple rt.

There is of course nothing "Living" about the W3C spec, which still advertises the old solutions in https://www.w3.org/International/articles/ruby/markup.en.html , but which has not been updated since 2016.

If we really want to support complex Ruby annotation, involving conflicting hierarchies, like one annotation for (1, 2) and another for (2, 3) (and believe me, we will not), we'll use more generic annotation attributes and bookmarks, rather than this explicit ruby mechanism, which we do need now for JIS. I am making the concession of allowing nested ruby in Semantic XML, which allows an annotation of both a phrase and its components, but not multiple clashing hierarchies.

In this scheme, we are not preserving rtc, rbc or rb. I agree with the Living HTML people that nested ruby is sufficient (although there is no way the W3C's example of nested ruby in https://www.w3.org/International/articles/ruby/markup-data/eg_dbl_mono_nested is semantically correct.)

Word HTML, if it supports ruby at all, is HTML 4, and may require downgrading of Ruby.

opoudjis commented 7 months ago

For Asciidoctor markup:

[annotate,from=tag1,to=tag2,series=河上公]

We're not doing that. We may do that separately later, as part of generic annotation support for very tricky annotations if they ever come up, but this is simply not workable as the first line of Ruby support.

We will do something like:

// [{text}][{reading}][{script code of reading}]
label:日本語[lang=jp,script=Hira,reading=にほんご]

But we need nesting. So:

ruby:{annotation}[lang=ja,script=Hira,type=phonetic|annotation,base-text]

does the basic unnested case, with lang and script optional, and type defaulting to phonetic; so

Spanning multiple ideograms:

ruby:とうきょう[東京]
ruby:とう | きょう[lang=ja,script=Hira,type=phonetic,東京]
ruby:Tōkyō[script=Latn,東京]
ruby:ライバル[親友] (Japanese for "frenemy": "friend" annotated with English loanword "rival", from https://en.wikipedia.org/wiki/Furigana)

One per ideogram:

ruby:とう[東] ruby:きょう[京]
ruby:Tō[script=Latn,東]ruby:kyō[script=Latn,京]

Nesting Asciidoc macros is horrific, but that's the only way I see forward with this. So, examples from https://www.w3.org/International/articles/ruby/markup.en.html ; I am changing the markup to markup with nested tags that I can make sense of. Note that the initial W3C recommendation did not allow nested ruby at all.

One character at a time, two annotations:

<ruby>
  <ruby><rb>東</rb><rt>とう</rt></ruby>
  <rt>tou</rt>
</ruby>
<ruby>
  <ruby<rb>南</rb><rt>なん</rt></ruby>
  <rt>nan</rt>
</ruby>
の方角
ruby:とう[ruby:tou[東\]] ruby:なん[ruby:nan[南\]] の方角

One character with annotation, one character with annotation, annotation for the character group:

<ruby>
  <ruby><rb>東</rb><rt>とう</rt></ruby>
  <ruby><rb>南</rb><rt>なん</rt></ruby>
  <rt>たつみ</rt>
</ruby>
の方角
ruby:たつみ[ruby:とう[東\]{blank}ruby:なん[南\]]

Annotation for one character, annotation for a phrase including that character initially:

<ruby>
  <ruby><rb>護</rb><rt>まも</rt>れ</ruby>
  <rt>プロテゴ</rt>
</ruby>!
ruby:プロテゴ[ruby:まも[護\]{blank}れ]!

Annotation for one character, annotation for a phrase including that character finally:

<ruby>
  <ruby>れ<rb>護</rb><rt>まも</rt></ruby>
  <rt>プロテゴ</rt>
</ruby>!
ruby:プロテゴ[れ{blank}ruby:まも[護\]]!

We might introduce a more complex input markup, but the foregoing will work.

opoudjis commented 7 months ago

@ronaldtse I need signoff before proceeding with these, and I really don't want this to get complicated and solve edge cases.

opoudjis commented 7 months ago

Given nested ruby in the syntax, we can just do:

ruby = element ruby {
  (ruby_pronunciation |ruby_annotation),
  text | ruby
}

i.e. one annotation per ruby instance.

opoudjis commented 7 months ago

Not waiting any more...

opoudjis commented 7 months ago

Finalised markup examples:

      ruby:とうきょう[東京]
      ruby:とうきょう[lang=ja,script=Hira,type=pronunciation,東京]
      ruby:Tōkyō[type=phonetic,script=Latn,東京]
      ruby:ライバル[type=annotation,親友]
      ruby:とう[東] ruby:きょう[京]
      ruby:Tō[script=Latn,東]ruby:kyō[script=Latn,京]

      ruby:とう[ruby:tou[東\]] ruby:なん[ruby:nan[南\]] の方角
      ruby:たつみ[ruby:とう[東\]{blank}ruby:なん[南\]]
      ruby:プロテゴ[ruby:まも[護\]{blank}れ]!
      ruby:プロテゴ[れ{blank}ruby:まも[護\]]!

with Semantic XML:

<ruby><pronunciation value="とうきょう"/>東京</ruby>
<ruby><pronunciation value="とうきょう" lang="ja" script="Hira"/>東京</ruby>
<ruby><pronunciation value="Tōkyō" script="Latn"/>東京</ruby>
<ruby><annotation value="ライバル"/>親友</ruby>
<ruby><pronunciation value="とう"/>東</ruby> <ruby><pronunciation value="きょう"/>京</ruby>
<ruby><pronunciation value="Tō" script="Latn"/>東</ruby><ruby><pronunciation value="kyō" script="Latn"/>京</ruby>

<ruby><pronunciation value="とう"/><ruby><pronunciation value="tou"/>東</ruby></ruby> <ruby><pronunciation value="なん"/><ruby><pronunciation value="nan"/>南</ruby></ruby> の方角
<ruby><pronunciation value="たつみ"/><ruby><pronunciation value="とう"/>東</ruby><ruby><pronunciation value="なん"/>南</ruby></ruby>
<ruby><pronunciation value="プロテゴ"/><ruby><pronunciation value="まも"/>護</ruby>れ</ruby>!
<ruby><pronunciation value="プロテゴ"/>れ<ruby><pronunciation value="まも"/>護</ruby></ruby>!</p>
opoudjis commented 7 months ago

The Presentation XML is going to be HTML 5 Ruby, with rb and rp and rt; in case of double annotations, the nested annotation will be placed under the character via Ruby CSS. If we ever need to deal with vertical scripts (and we won't), we can implement script-specific rendering.

opoudjis commented 7 months ago

Safari:

Screenshot 2023-12-18 at 23 25 59

Chrome:

Screenshot 2023-12-18 at 23 26 33

Note that Safari does not support W3C ruby-position, so we need to add -webkit-ruby-position.

opoudjis commented 7 months ago

It's looking like Word does not support double-sided Ruby. Ouch...

In fact, the input menu for Ruby in the Word Japanese language phonetic guide tool explicitly expects base + ruby, i.e. there is no provision for two annotations.

We are forced to do a bracketed workaround, unless we are told otherwise by practitioners.

I can't work out what else to do here, because I cannot install the East Asian proofing tools for Word. I see that underlyingly these are overstrike fields, but I'm reluctant to get too deep in experimentation here.

opoudjis commented 7 months ago

Word:

Screenshot 2023-12-19 at 00 15 15

opoudjis commented 7 months ago

Remaining action: write blog post.