metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 2 forks source link

Native-language markup for CJK and non-Latin languages #81

Open ronaldtse opened 4 weeks ago

ronaldtse commented 4 weeks ago

ASCII-based rich-text markup languages like AsciiDoc and Markdown benefit from usability as their command palette can be fully accessible through the English/Latin-based ASCII keyboards.

However, using ASCII-based rich-text markup languages on non-Latin keyboards (CJK and others like Greek) is not entirely convenient, as it requires the user to switch the keyboard back to English before being able to access the necessary ASCII keystrokes.

This causes users to require a context switch between:

Which requires the user to switch back and forth, defeating a key advantage of the "plain-text" approach.

CJK cannot be done in ASCII, so the consequences of an "easy-to-enter textual semantic syntax" for CJK are different from AsciiDoc. We need defined rules on what "AsciiDoc" means for "non-ASCII CJK", with the principle that it should be easy to type on a CJK keyboard.

AsciiDoc syntax heavily depends on control symbols that are not easily accessed/used in CJK:

We should come up with a language-native solution.

Originally posted by @ronaldtse in https://github.com/metanorma/coradoc/issues/80#issuecomment-2143703116

ReesePlews commented 4 weeks ago

@ronaldtse interesting ideas here about CJK support. are you proposing that double byte chars (DBCs) would then be intermixed in the actual asciidoc code? if that is the case, wouldn't that introduce a lot of problems? i dont know of any programming or encoding language, even those designed here, that use DBCs for the actual components of the encoding language. conversion issues between JIS/SJIS/EUC/etc still exist today but software today can handle those quite easily depending on the nature of the data and widespread use of UTF.

CJK cannot be done in ASCII, so the consequences of an "easy-to-enter textual semantic syntax" for CJK are different from AsciiDoc. We need defined rules on what "AsciiDoc" means for "non-ASCII CJK", with the principle that it should be easy to type on a CJK keyboard.

do you mean that the actual input switch between entering DBC and ascii introduces errors into the code or makes it difficult to input? i think programmers are very use to input using a front-end processor, when they have to.

i think the beauty of the programming languages are their consistent use of single encoding for statements and the support of other character sets as needed for human readable output content.

perhaps i am misunderstanding what you are proposing. i look forward to more discussion.

ronaldtse commented 4 weeks ago

conversion issues between JIS/SJIS/EUC/etc still exist today but software today can handle those quite easily depending on the nature of the data and widespread use of UTF

Indeed, Unicode does work and is already universal enough.

do you mean that the actual input switch between entering DBC and ascii introduces errors into the code or makes it difficult to input?

I believe the ASCII-assumption makes it difficult to input for CJK. I have encoded content in Chinese, and I found using ASCII to input control sequences in AsciiDoc cumbersome.

There are definitely people who don't feel the same way or find it equally convenient to switch keyboards, but I am not one of them... I find it cumbersome trying to switch around keyboards just to type a control sequence.

opoudjis commented 3 weeks ago

Ronald briefly discussed this with me earlier today, and I did not have time to continue the discussion because I was busy in my day job. I am also busy with my Metanorma job, but:

ReesePlews commented 3 weeks ago

a very interesting discussion here...

in the example, 。推奨事項 is mapped to .推奨事項and in the .adoc file it would be written as 。推奨事項 is that the correct interpretation?

i am sorry but that just seems so confusing. i am not sure if anyone would do this... and only to save time?

coders, even people writing a lot of documents with MS-WORD easily understand the difference of vs.

to clarify, if i understand the proposal is something like:

==== a_CJK_term
a_CJK_definition

NOTE:a_CJK_note 

if that is not the case, i think more examples are needed.

is this the correct understanding? that is really difficult to input. my FEP actully wants to make an 8bit NOTE: and it was very difficult to get the 16bit chars to even come out. also, the space after the ==== will need to be a dbc space, cannot be a single space in this idea.

I don't know how far this gets you, and I have no idea if you can get out of typing Ascii = in CJK; presumably it's U+FF1D = ?

i did not have to enter any codes to get the characters to appear. i just turned on the kanji input (via a key command (alt ` ) on my keyboard.

if this indeed is being suggested, i dont think anyone will use it, just my opinion. i think coders are more use to coding programming language constructs then adding CJK text inline to a document. having everything in CJK would possibly be a burden...

as a user, if this was put to vote as an enhancement, there are a number of other enhancements i would propose / vote for before this.

however, i do i agree it is an interesting discussion. for more than 30 years i used a japanese keyboard and was very used to it. then for health reasons i switched to kinessis split keyboard that has an english layout. dealing with some kanji input cases is more difficult than the earlier japanese keyboards. however, i think my keyboard would be difficult to use for a native japanese typist, or it would take some getting used to.

i wonder what type of feedback there would be from slack exchange or reddit communities about this proposal?

i think it would really add to the discussion, but a clear set of examples would be needed, in my opinion.

ronaldtse commented 3 weeks ago

Backtracking a bit.

The purpose of AsciiDoc is:

With CJK content, it is unambiguous that the 1st point is not achieved.

In my personal experience, with the 1st point not achieved, it is cumbersome to type Chinese using AsciiDoc in its current form.

The goal of this ticket is to provide a way to achieve the 1st point while not losing the 2nd.

On my Chinese keyboard, these symbols are available.

·~【】「」,。《》/?;:‘“、|-——=+!@#$%⋯⋯&*()1234567890

The inner brackets can be entered when you type the same bracket symbol inside a pair of brackets:

「『』」《〈〉》

That's it. Anything else will require me to switch keyboards. My goal as a user is just to be able to use these symbols instead of the ASCII ones.

ReesePlews commented 3 weeks ago

in thinking a bit more, i dont see this as anything related to metanorma but only asciidoc or a programming language character input issue... i still may not be understanding the problem...

what if this "ease of input" was built as a plugin for ms visual studio code? this way it is focused on specific users, not a platform. CJK users who want to save time install the plugin to vsc. i dont know if keeping CJK input mode on continually would work with vsc? i dont know how the resulting file would look... is it always going to be CJK symbols throughout? are they changed back to non CJK symbols? or are they never changed back, and then remain in the adoc files for other platforms to handle (decode?)

i an understand the ease of input aspect of the idea. if the CJK characters were input and then saved (changed back) when you start to edit that file again it would be non-CJK language constructs + any CJK content strings, and you begin making edits in CJK (ease of input). ... to me it seems better outside as a tool, instead of inside the mn platform... i think if there were a mix of CJK and non CJK encoded adoc files in a project, it would only introduce confusion. i do understand the ease of use aspect, i know it is critical and productivity can drastically be reduced with lots of "language input switching" it really slow things down. its a very interesting discussion, thank you for raising it.

hmdne commented 1 day ago

I'd say, supporting those characters in AsciiDoc would make us deviate from AsciiDoc.

But, why not create a new format that would be distinguished by extension, let's say, ExtendedAsciiDoc with extension .eadoc that would support those rules (in addition to existing ones)? Then, let's say, CoraDoc would handle that format if extension is correct.

And, since we have #to_adoc basically ready, tested and working, if added parsing support, it would be possible to convert from said .eadoc document to AsciiDoc regular.