Support Japanese numerals

Intelligent2013 commented 4 days ago

Source issue: https://github.com/metanorma/metanorma-jis/issues/226

Support Japanese numerals in

[x] clause numbers Example:
[x] ordered list items Example:
[x] edition number currently, there are two elements in the Presentation XML:
```
<edition language="">1</edition>
<edition language="ja">第1版</edition>
```
[x] publication date Example: 令和元年七月二十二日 Current Presentation XML: <date type="published">令和元年7月22日</date>

If this task is complicated, then I'll find how to do this via XSLT extensions on Java.

@ronaldtse does we need to support two number formats - Arabic (1, 2, 3, ...) for usual documents and Japanese (一, ...) for vertical layout documents? Or only Japanese numbers?

Note: I don't know the reason, but the notes numbers should be Arabic:

UPDATE after the comment

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

[ ] notes, examples numbers

ReesePlews commented 3 days ago

very interesting to see the vertical layout. thanks for all the work on this @Intelligent2013 ! i dont work with vertical layout much but the third image above looks more correct than the second image. the layout of the kanji numbers in the first image appears correct for the main clause numbers, but with the sub-clause numbering, the vertical style of '三・一' etc seems different to me... i guess, in theory, that is the correct style but seems a bit difficult on the eyes; again i dont have enough experience with vertical layout. i suspect that vertical layout is widely used by such agencies as the justice ministry (法務省) and the writing of japanese laws/regulations. i know there is a large legal website that has japanese laws with english translations, but off hand i dont remember the link. they may have samples of printed works online that could be helpful in these cases.

ronaldtse commented 3 days ago

Thank you @ReesePlews ! Yes you are right that the Japanese "e-Gov" website has all the Japanese laws.

For example, this is the Constitution of Japan:

https://laws.e-gov.go.jp/law/321CONSTITUTION

For vertical layout, they have 3 options: 1 column, 2 columns and 4 columns

This is the law that establishes JIS:

https://laws.e-gov.go.jp/law/417M60000F00006

For space savings, this is a screenshot of the 4 column (so it's not too tall to show here).

It uses the list style:

1, 2...
一, 二, ...
イ, ロ, ...
(1), (2)...
(i), (ii)...

The list style only uses a single full width space indentation to separate list levels.

UPDATE: It seems that when Paragraphs are labeled, in the e-Gov website the paragraph label for the first paragraph is omitted, and subsequent paragraph labels exist. Not sure why the list item "1" is missing though. This doesn't seem to be an East Asian tradition.

Intelligent2013 commented 3 days ago

The 1st post updated - added 'edition number'.

opoudjis commented 2 days ago

There's two elements to this.

The first is to support Japanese numerals, and I can do that, sure: that's merely 2.localize(:ja).spellout, using twitter_cldr.

The second is to work out where to use Japanese numerals instead of Arabic numerals. This should not be being done on an ad hoc basis, and it should not be being done independently in HTML and PDF: there needs to be a rule as to where it happens, and it needs to be done in Presentation XML.

I have the bad feeling that this is going to end up as a document attribute.

ronaldtse commented 2 days ago

I have the bad feeling that this is going to end up as a document attribute.

You mean the specification of list bullet styles per level being configurable? I'd (everyone would) love that.

opoudjis commented 2 days ago

I don't even know if I can do that in HTML. Not without a lot of pain.

And you need to say a lot more about where Japanese numbers are meant to show up. Numbering is done in code; I can make the xref counter output Japanese instead of Arabic numerals, but that means initialising each counter instance in isodoc, one for every block type and clause (figures, tables, requirements, etc etc etc).

Without a coherent statement, you are not getting anything.

ronaldtse commented 2 days ago

Note: I don't know the reason, but the notes numbers should be Arabic:

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

opoudjis commented 2 days ago

You mean the specification of list bullet styles per level being configurable? I'd (everyone would) love that.

PER LEVEL?! No you are not getting random list level specification PER LEVEL. ISO HTML CSS has 30 lines of custom code just to insert ")" after list numbers. https://github.com/metanorma/isodoc/issues/247 has been unactioned for the past four years because of how horrible Word HTML is about custom list numbering.

No, what you're going to get is:

A document attribute specifying whether Japanese or Arabic auto-numbering is to be used in the document. I am not going to be supporting vague notions of new flavours or document types: I am yet to see evidence that there is a coherent mapping of Japanese numbering to document type or organisation at all, and I'm not going to wait for one.
Restriction of Japanese number styling to clauses, ordered lists, and edition numbers. Each and every numbering counter is a separate variable, and if any one of them outputs Arabic, they need to be set individually. I am not at this time going to assume that Japanese numbering is used for all autonumbering in the document, for the simple reason that the sample document does not, and it is not our place to dictate to people what numbers they use universally.

Ordered lists will rely on the Presentation XML feature of //ol/li/@label to tell the consumer what to put in the list. This will only work out of the box for PDF, and there is code from other flavours that can make it work for DOC; HTML would need CSS overriding to make it work.

I am considering this nothing more than a proof of concept.

opoudjis commented 2 days ago

I'm going to realise this with the document attribute

:presentation-metadata-japanese-numbering: true

opoudjis commented 2 days ago

@ronaldtse wants to generalise this to Arabic, Chinese, and Amharic.

I have little inclination to do so, and this does not address the very real problem of what types of block are going to be Arabic and what local.

But:

:presentation-metadata-autonumbering-style: japanese

The nightmare scenario is:

:presentation-metadata-notes-autonumbering-style: arabic
:presentation-metadata-clause-autonumbering-style: japanese
:presentation-metadata-subclause-autonumbering-style: arabic

I will not be implementing that.

opoudjis commented 2 days ago

To make counters more configurable, I'm going to eventually set up configuration of all counters—starting value and style. But for now, I'm only going to expose that for clauses and lists.

opoudjis commented 2 days ago

I've got a problem: I want to assign config to counter classes based on config in the xref class (which knows about numbering styles from the Presentation XML metadata), but I don't want to redefine all the classes invoking them.

So to exploit inheritance, I'm going to have to define these counter classes with methods invoked from the xref class.

opoudjis commented 2 days ago

Not working yet...

Intelligent2013 commented 2 days ago

Also we need to support Japanese numerals in the publication date. I've updated the initial post.

opoudjis commented 1 day ago

I am providing Japanese numbering in the Presentation XML, but there is a nightmare scenario where you provide Japanese numbering for page numbers. If you do need them, and if XSL:FO is not clever enough to do that automatically, I'll need to dump the numbers 1–1,000 in the localization strings. Let's not action that yet though... I'd be surprised if XSL:FO doesn't provide that natively somewhere.

Intelligent2013 commented 1 day ago

I am providing Japanese numbering in the Presentation XML, but there is a nightmare scenario where you provide Japanese numbering for page numbers. If you do need them, and if XSL:FO is not clever enough to do that automatically, I'll need to dump the numbers 1–1,000 in the localization strings. Let's not action that yet though... I'd be surprised if XSL:FO doesn't provide that natively somewhere.

@opoudjis Apache FOP has the extension fox:number-conversion-features (https://xmlgraphics.apache.org/fop/2.0/complexscripts.html#source), but looks like it's not working at all, may be I try something wrong... For any case, let's dump the numbers 1–1,000 in the localization strings when you have a time. The page numbers changing should be applied in IF (Intermedia Format) after XSL-FO generation.

opoudjis commented 1 day ago

We need to localise the clause number delimiter, from half-width to full-width full stop, if Japanese numbering is used.

And I'm going to use this as the opportunity to implement a fix to CJK punctuation called on in https://github.com/relaton/relaton-render/issues/52, which I have not implemented to date because of @ronaldtse ’s indefensible notion that

Johnson、 A。、 Peters、 B。 1976。 The origins of sound 【series】。 London〯Blackwells

is desirable punctuation.

It is not, I reject with utmost vehemence any claim that it is (and so has Reese) and I am pressing ahead with the correct solution.

Regardless of the document main language, punctuation localisation will convert punctuation from half-width to full-width only if at the characters on either side are CJK.

So:

All clause numbers will now be subject to punctuation localisation.
Regardless of the language of the document, a clause number like "2.1" will ABSOLUTELY NOT be converted to "2。1", because that is insane, and makes me look incompetent.
The clause number "二.一" will however be converted to "二。一", because the dot is surrounded by CJK characters.
Annex number "A.一" will not be converted to "A。一"

I am also going to bite the bullet and move Japanese number rendering to isodoc for xref counters; they already support Roman at top level.

opoudjis commented 1 day ago

As of this ticket, we are making punctuation localisation (i.e. fullwidth punctuation) apply to Japanese and Korean as well as Chinese, with the proviso of not doing so when the surrounding characters are not CJK.

opoudjis commented 1 day ago

So

Code (hello, world.)

in a Chinese or Japanese document:

Before:

Code （hello， world．）

After:

Code (hello, world.)

opoudjis commented 1 day ago

I've implemented so far:

1–1000 in localized-strings
clause numbers, with full-width delimiter where appropriate
edition number

As a result of extending CJK punctuation localisation to Japanese, we are now removing redundant Roman spaces in Japanese stringss.

I'm attaching a simple test document so you can see this working, with Japanese and Arabic autonumbering.

Japanese.zip Arabic.zip

@Intelligent2013 Check them out. The dates and ordered lists will happen tomorrow.

ReesePlews commented 1 day ago

Code (hello, world.)

in a Chinese or Japanese document:

Before:

Code （hello， world．）

After:

Code (hello, world.)

this process is correct in my opinion. the "After" result is expected because there are no CJK characters in that string.

it is very common that the "Before" case exists in Japanese documents (not programming code), are mostly just input mistakes, depending on the FEP (font end processor) used for input, or sometimes the user/editor does not catch the differences in characters due to the font used.

i have tried to follow these discussions, but i could be lacking a clear understanding... when the word "Code" is used does it specifically refer to "programming language code"? if so, the result in "After" is most definitely correct.

i am trying to imagine how this would look in a regular CJK document clause [not a programming "code" block] use. i believe the western/8bit text between the ( )'s would commonly be used with western/8bit punctuation, however the surrounding ( )'s could end up being entered as CJK （　）'s because there is leading and trailing CJK text around the western/8bit text.

i apologize if i have mistaken the crux of the discussion here.

Intelligent2013 commented 1 day ago

Check them out. The dates and ordered lists will happen tomorrow.

@opoudjis thank you. The numbers looks ok. Except the dots between digits, I don't know it's issue or not:

I can replace them (U+FF0E, Fullwidth Full Stop) in the XSLT on-fly by U+30FB (Katakana Middle Dot), then it look as in the source template PDF:

Another issue is the clauses order in a.presentation.xml - the Normative references order is 2:

<references id="_normative_references" normative="true" obligation="informative" displayorder="2">
            <title depth="1">一<tab/>引用規格</title>

but the 1st clause order is 8:

<clause id="_clause" inline-header="false" obligation="normative" displayorder="8">
            <title depth="1">二<tab/>Clause</title>

therefore the Normative references renders before the title on the first page (see 1st screenshot).

And I didn't see the edition number in Japanese:

I've implemented so far: ...

edition number

From a.presentation.xml: <edition language="">1</edition><edition language="ja">第1版</edition>

opoudjis commented 17 hours ago

The middle dot is telling me that I need not to make a blanket assumption of "." as a subclause number delimiter, which can be localised to full-width. Instead I need to make it a parameter on calling the counter, and make it separate from the number prefix, so that it can be configured separately. So instead of

Counter.new(0, prefix: "#{clausenumber}.")

which will generate "#{clausenumber}.1", "#{clausenumber}.2", "#{clausenumber}.3"...

I need

Counter.new(0, prefix: clausenumber, separator: ".")

and the JIS calls to Counter override separator with middle dot, if the numbering style has been set to Japanese:


IsoDoc::Xref
 def initialize(opts)
   @separator = opts[:separator] || "." # default separator
 end

  def clause_counter(number, opts)
    Counter.new(number, opts)
  end

IsoDoc::Xref::JIS

def clause_counter(number, opts)
  opts[:number] ||= @autonumber_style # read from the XML, may be :japanese or :arabic
  @autonumber_style == :japanese and
    opts[:separator] ||= &#x30fb;
  super
end

That will generate "#{clausenumber}・一", "#{clausenumber}・二", "#{clausenumber}・三" when the numbering is set to Japanese.

(That is implemented in JIS and not globally for Japanese text, because subclause delimiters are a flavour choice: nothing is preventing a different organisation having clause numbers like 1-2 or 一〰二

This is a breaking change to isodoc, as I am refactoring all instances of Counter(prefix:).

opoudjis commented 17 hours ago

@Intelligent2013 The edition numbering works in testing, so I will need to investigate that. The list numbering will also be complicated.

opoudjis commented 17 hours ago

Reese, the point of what I have written is the following:

Automated text generation in Metanorma uses Latin punctuation
Latin punctuation in CJK text needs to be switched to full-width punctuation, if it is automated text
But not if the Latin punctuation is adjacent to Latin text
If users actually want CJK punctuation inside Latin text (which Ronald seems to think they do), then it needs to be set as such in the outset: CJK punctuation will not be converted back to Latin
My use of "Code" is a random example. Try, more to the point:

二.二 => 二。二 ( although it looks like I will need to override this with middle-dot anyway) A.2 => A.2 (unchanged; previously it would have attempted A。2)

ronaldtse commented 17 hours ago

@opoudjis the Japanese "middle dot" delimiter is not the "full stop", they are different symbols.

ronaldtse commented 17 hours ago

If users actually want CJK punctuation inside Latin text (which Ronald seems to think they do), then it needs to be set as such in the outset: CJK punctuation will not be converted back to Latin

No, that's not what I asked for. The default for bibliographic entries is to be rendered in a suitable style, i.e. English in English, Japanese in Japanese. We could have Japanese in English or English in Japanese but that should not be the default.

opoudjis commented 17 hours ago

Bibliographic entries will routinely be mixed-language, with things like Japanese authors and English titles. The notion of a bibliographic entry being "just Japanese" or "just English" is naive and inflexible. It is also is a nuisance on top of trying to work out what the language of a bibliographic entry is to begin with. (You think users are going to be marking it up as [lang=ja]? And then mark up titles individually as exceptions? When we can work out the script automatically through Regex?)

That's why working out whether to apply CJK punctuation contextually, rather than based solely on a language tag, has ALWAYS been the right way to proceed, and I am proceeding with it.

Rereading, the default is indeed going to be CJK, but it will be overridden when the immediate context shows that full-width punctuation makes no sense (the surrounding characters are Latin). And I simply cannot trust users to exhaustively mark up references (let alone individual bits of references) to indicate language explicitly.

opoudjis commented 17 hours ago

@opoudjis the Japanese "middle dot" delimiter is not the "full stop", they are different symbols.

As I have just acknowledged, which is why I am doing the refactoring.

opoudjis commented 9 hours ago

From a.presentation.xml: 1第1版

You're looking at the wrong file: I am generating

<edition language="">1</edition><edition language="ja">第一版</edition>

in the Japanese numbering version. You'll have a refresh soon.

opoudjis commented 8 hours ago

ordered list items

This is an update to JIS. JIS has Alphabetic numbering on its first level of ordered lists, and Arabic numbering on subsequent levels. I don't know what the provenance of the PDF sample is, and I do not care: I am not overriding JIS list numbering for some unasked-for proof of concept. I am implementing Japanese numbering to replace Arabic numbering in ordered lists ONLY where JIS sanctions that.

opoudjis commented 7 hours ago

As warned: HTML right now has no idea what to do with custom list labels.

@Intelligent2013 The following should have now everything you need for this proof of concept.

Archive.zip

Intelligent2013 commented 7 hours ago

You're looking at the wrong file: I am generating
<edition language="">1</edition><edition language="ja">第一版</edition>
in the Japanese numbering version. You'll have a refresh soon.

Ok. please note I need just 一 without 第 版 around it. And we need to keep the value 第1版 for current (not-vertical) layout. I.e. like this <edition language="">1</edition><edition language="ja">第1版</edition><edition language="ja" numberonly="true">一</edition>.

opoudjis commented 6 hours ago

Yuck, that's really adhoc. OK...

opoudjis commented 6 hours ago

@Intelligent2013 Here you go.

Archive 2.zip

Intelligent2013 commented 5 hours ago

Ordered lists look ok:

Thanks!

Now, testing edition number....

Intelligent2013 commented 5 hours ago

@opoudjis the edition number is ok also. Thanks!

I've updated the initial post for notes, examples numbers:

Note: I don't know the reason, but the notes numbers should be Arabic:

@Intelligent2013 I just noticed this since @opoudjis raised it. They are meant to be in Japanese numerals too.

metanorma / metanorma-jis

Support Japanese numerals #228