Native-language markup for CJK and non-Latin languages

metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma

MIT License

1 stars 2 forks source link

Native-language markup for CJK and non-Latin languages #81

Open ronaldtse opened 4 weeks ago

ronaldtse commented 4 weeks ago

ASCII-based rich-text markup languages like AsciiDoc and Markdown benefit from usability as their command palette can be fully accessible through the English/Latin-based ASCII keyboards.

However, using ASCII-based rich-text markup languages on non-Latin keyboards (CJK and others like Greek) is not entirely convenient, as it requires the user to switch the keyboard back to English before being able to access the necessary ASCII keystrokes.

This causes users to require a context switch between:

command sequences => English keyboard
text => native language keyboard

Which requires the user to switch back and forth, defeating a key advantage of the "plain-text" approach.

CJK cannot be done in ASCII, so the consequences of an "easy-to-enter textual semantic syntax" for CJK are different from AsciiDoc. We need defined rules on what "AsciiDoc" means for "non-ASCII CJK", with the principle that it should be easy to type on a CJK keyboard.

AsciiDoc syntax heavily depends on control symbols that are not easily accessed/used in CJK:

empty ASCII space
- delimiter after clause title specification ==)
- with multiple text lines without 2 sequential newline characters, join those lines into one paragraph replace the newline character with a single empty space
empty line
- empty line denotes new paragraph
- with multiple text lines without 2 sequential newline characters, join those lines into one paragraph replace the newline character with a single empty space
asterisk *, backtick ```
- for inline formatting
brackets [], braces `{}
- for options and attributes
vertical bar |
- for table delimiters
equal sign =
- for clause headings
- for block delimiter
underscore _
- for block delimiter
hyphen -
- for block delimiter
double quotes "
- for options and attributes
comma ,
- for options and attributes
ASCII numbers and characters, 0-9, a-z, A-Z
- for options and attributes

We should come up with a language-native solution.

Originally posted by @ronaldtse in https://github.com/metanorma/coradoc/issues/80#issuecomment-2143703116

ReesePlews commented 4 weeks ago

@ronaldtse interesting ideas here about CJK support. are you proposing that double byte chars (DBCs) would then be intermixed in the actual asciidoc code? if that is the case, wouldn't that introduce a lot of problems? i dont know of any programming or encoding language, even those designed here, that use DBCs for the actual components of the encoding language. conversion issues between JIS/SJIS/EUC/etc still exist today but software today can handle those quite easily depending on the nature of the data and widespread use of UTF.

CJK cannot be done in ASCII, so the consequences of an "easy-to-enter textual semantic syntax" for CJK are different from AsciiDoc. We need defined rules on what "AsciiDoc" means for "non-ASCII CJK", with the principle that it should be easy to type on a CJK keyboard.

do you mean that the actual input switch between entering DBC and ascii introduces errors into the code or makes it difficult to input? i think programmers are very use to input using a front-end processor, when they have to.

i think the beauty of the programming languages are their consistent use of single encoding for statements and the support of other character sets as needed for human readable output content.

perhaps i am misunderstanding what you are proposing. i look forward to more discussion.

ronaldtse commented 4 weeks ago

conversion issues between JIS/SJIS/EUC/etc still exist today but software today can handle those quite easily depending on the nature of the data and widespread use of UTF

Indeed, Unicode does work and is already universal enough.

do you mean that the actual input switch between entering DBC and ascii introduces errors into the code or makes it difficult to input?

I believe the ASCII-assumption makes it difficult to input for CJK. I have encoded content in Chinese, and I found using ASCII to input control sequences in AsciiDoc cumbersome.

There are definitely people who don't feel the same way or find it equally convenient to switch keyboards, but I am not one of them... I find it cumbersome trying to switch around keyboards just to type a control sequence.

opoudjis commented 3 weeks ago

Ronald briefly discussed this with me earlier today, and I did not have time to continue the discussion because I was busy in my day job. I am also busy with my Metanorma job, but:

Keyboard switching is indeed annoying
This issue specifically is not an issue for Greek or Cyrillic, because they share Latin space and punctuation
I live in horror of the time we try to do Arabic or Hebrew Metanorma; it will unleash any number of Lovecroftian horrors with Bidi
There is a limited number of line-initial characters that are used to signal markup in Asciidoc
The way I see of CJK Asciidoc least painfully is to allow CJK characters equivalent to the Ascii Asciidoc markup to Ascii, and then canonically map them to the Ascii counterparts in preprocessing. (Does this reimplementation of Asciidoc even support preprocessing?)
So a line initial 。in 。推奨事項 is mapped to .推奨事項, and treated as a caption. A line containing 〚〚推奨事項〛〛 is mapped to [[推奨事項]], and treated as an anchor.
I don't know how far this gets you, and I have no idea if you can get out of typing Ascii = in CJK; presumably it's U+FF1D ＝ ?
You've also identified the problems that the different treatment of space and line breaks in CJK introduce to a WYSIWYG.
And given that no paying customer is asking for these, I can only say about this what I've already said about https://github.com/metanorma/metanorma/issues/390 ...

ReesePlews commented 3 weeks ago

a very interesting discussion here...

in the example, 。推奨事項 is mapped to .推奨事項and in the .adoc file it would be written as 。推奨事項 is that the correct interpretation?

i am sorry but that just seems so confusing. i am not sure if anyone would do this... and only to save time?

coders, even people writing a lot of documents with MS-WORD easily understand the difference of 。vs.

to clarify, if i understand the proposal is something like:

＝＝＝＝　a_CJK_term
a_CJK_definition

ＮＯＴＥ：a_CJK_note

if that is not the case, i think more examples are needed.

is this the correct understanding? that is really difficult to input. my FEP actully wants to make an 8bit NOTE: and it was very difficult to get the 16bit chars to even come out. also, the space after the ＝＝＝＝ will need to be a dbc space, cannot be a single space in this idea.

I don't know how far this gets you, and I have no idea if you can get out of typing Ascii = in CJK; presumably it's U+FF1D ＝ ?

i did not have to enter any codes to get the characters to appear. i just turned on the kanji input (via a key command (alt ` ) on my keyboard.

if this indeed is being suggested, i dont think anyone will use it, just my opinion. i think coders are more use to coding programming language constructs then adding CJK text inline to a document. having everything in CJK would possibly be a burden...

as a user, if this was put to vote as an enhancement, there are a number of other enhancements i would propose / vote for before this.

however, i do i agree it is an interesting discussion. for more than 30 years i used a japanese keyboard and was very used to it. then for health reasons i switched to kinessis split keyboard that has an english layout. dealing with some kanji input cases is more difficult than the earlier japanese keyboards. however, i think my keyboard would be difficult to use for a native japanese typist, or it would take some getting used to.

i wonder what type of feedback there would be from slack exchange or reddit communities about this proposal?

i think it would really add to the discussion, but a clear set of examples would be needed, in my opinion.

ronaldtse commented 3 weeks ago

Backtracking a bit.

The purpose of AsciiDoc is:

Make it easy for people who use the Latin keyboard to enter keystrokes
Label content semantically

With CJK content, it is unambiguous that the 1st point is not achieved.

In my personal experience, with the 1st point not achieved, it is cumbersome to type Chinese using AsciiDoc in its current form.

The goal of this ticket is to provide a way to achieve the 1st point while not losing the 2nd.

On my Chinese keyboard, these symbols are available.

·～【】「」，。《》/？；：‘“、｜-——=+！@#$%⋯⋯&*（）1234567890

The inner brackets can be entered when you type the same bracket symbol inside a pair of brackets:

「『』」《〈〉》

That's it. Anything else will require me to switch keyboards. My goal as a user is just to be able to use these symbols instead of the ASCII ones.

ReesePlews commented 3 weeks ago

in thinking a bit more, i dont see this as anything related to metanorma but only asciidoc or a programming language character input issue... i still may not be understanding the problem...

what if this "ease of input" was built as a plugin for ms visual studio code? this way it is focused on specific users, not a platform. CJK users who want to save time install the plugin to vsc. i dont know if keeping CJK input mode on continually would work with vsc? i dont know how the resulting file would look... is it always going to be CJK symbols throughout? are they changed back to non CJK symbols? or are they never changed back, and then remain in the adoc files for other platforms to handle (decode?)

i an understand the ease of input aspect of the idea. if the CJK characters were input and then saved (changed back) when you start to edit that file again it would be non-CJK language constructs + any CJK content strings, and you begin making edits in CJK (ease of input). ... to me it seems better outside as a tool, instead of inside the mn platform... i think if there were a mix of CJK and non CJK encoded adoc files in a project, it would only introduce confusion. i do understand the ease of use aspect, i know it is critical and productivity can drastically be reduced with lots of "language input switching" it really slow things down. its a very interesting discussion, thank you for raising it.

hmdne commented 1 day ago

I'd say, supporting those characters in AsciiDoc would make us deviate from AsciiDoc.

But, why not create a new format that would be distinguished by extension, let's say, ExtendedAsciiDoc with extension .eadoc that would support those rules (in addition to existing ones)? Then, let's say, CoraDoc would handle that format if extension is correct.

And, since we have #to_adoc basically ready, tested and working, if added parsing support, it would be possible to convert from said .eadoc document to AsciiDoc regular.