metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 4 forks source link

reverse_adoc: Clean Unicode whitespace in headers and paragraphs #80

Closed hmdne closed 5 months ago

hmdne commented 5 months ago

This fixes #65 and fixes #67.

I don't necessarily agree with this. A full-width space is semantically similar to an NBSP, ie. it's not trimmed by web browsers. If anything, I think this should not be a generic feature - while for this particular usecase, full-width space has no meaning, other than formatting - in other documents they may be crucial.

The character still persists in table cells, lists and sections (which are mapped from DIVs):

Metanorma PR checklist

ronaldtse commented 5 months ago

@hmdne I understand your concern with regards to the full-width space, but the question is actually about the compatibility of "AsciiDoc" (which uses ASCII sequences as control/markup sequences) and CJK in general.

AsciiDoc syntax heavily depends on these control symbols that are not easily accessed/used in CJK:

Retracting our steps, notice that AsciiDoc was designed for "ASCII"-encoding, which is really made to allow easy and predictable entry on an English keyboard, and to an extent Latin based keyboards. CJK cannot be done in ASCII, so the consequences of an "easy-to-enter textual semantic syntax" for CJK are different from AsciiDoc. We need defined rules on what "AsciiDoc" means for "non-ASCII CJK", with the principle that it should be easy to type on a CJK keyboard.

The comments about "a full width space means something" are unintended consequences with AsciiDoc compatibility with CJK:

Removing whitespace from lists will cause certain tests to fail, as they expect a list item to end with " ".

It should not be the case. This is simply a CJK compatibility issue with AsciiDoc.

Removing them from sections will cause a document reflow (in this particular document, they are used as to ensure there's a deeper line break).

In CJK, the initial "full width spaces" (one or more than one) are formatting concerns. This is to be determined by the rendering template as part of "paragraph initial line indenting", it plays no part in the textual meaning.

Removing them from table cells - I have not tested the impact yet.

They should be stripped from the table cells.

ronaldtse commented 5 months ago

If I use the Japanese keyboard and retain the semantics of the equal sign, hyphens, spaces, open/close brackets, comma, I get this. This means I won't need to swap between Japanese/English when entering. Wondering if this is something we should support... "(Ascii)Doc for CJK"

= 日本語

「ソース、ruby」
ーーーー
ソースコード
ーーーー
ReesePlews commented 5 months ago

is there status on these updates? would a work-around be to fill any empty cells in a table with a single character?

hmdne commented 5 months ago

@ReesePlews @ronaldtse

I have pushed an updated version that deals with almost all of the leading CJK whitespace in the document while trying to preserve compatibility. The only issue is with sections: as mentioned above, empty paragraphs are collapsed, but this is an issue with this particular document and may not be really an issue, if it is, please inform me on that. I have found another problem, with generation, but I will try to amend that shortly.

hmdne commented 5 months ago

This is ready for merge now.

ronaldtse commented 5 months ago

Thanks @hmdne !