jgm / djot

A light markup language
https://djot.net
MIT License
1.66k stars 43 forks source link

CJK text and soft line breaks #131

Open crlf0710 opened 1 year ago

crlf0710 commented 1 year ago

One thing that is annoying and in my personal opinion prevented Markdown/CommonMark from being popular in a degree in East Asia, is that soft link breaks are generated between lines, leading to non-expected appearance of text.

For example, in markdown.

自强不息,厚
德载物

Will be rendered as

自强不息,厚 德载物

in browsers other than the expected

自强不息,厚德载物

I'd love to see this get fixed somehow in Djot, either automatically removal or manually annotated removal is ok i think.

Prior Arts

In existing solutions, like Microsoft's XAML markup language, there's the concept of Linefeed Collapsing Characters, when a linefeed has both a Linefeed Collapsing Character before and after it, the linefeed gets stripped. This works fine to a degree, except the cons that it involves using Unicode tables.

jgm commented 1 year ago

Pandoc has the east_asian_line_breaks extension:

Extension: east_asian_line_breaks Causes newlines within a paragraph to be ignored, rather than being treated as spaces or as hard line breaks, when they occur between two East Asian wide characters. This is a better choice than ignore_line_breaks for texts that include a mix of East Asian wide characters and other characters.

matklad commented 1 year ago

:thinking: this seems like an important thing to make just work by default. At the same time, yeah, this requires Unicode tables. But, as far as I understand, only during rendering, not during parsing.

I wonder if maybe we should just allow usage of various tables during html translation? That way, you can still implement a fully conforming djot parser with very little machinery, but, if you convert to HTML, you do have to handle east asian line breaks and emoji conversion.

jgm commented 1 year ago

I think renderers should be allowed to be as fancy as they want to be. As long as we have the softbreak element in the AST, the necessary information is there, and renderers can do what they like with it.

Eventually it would probably be good to split off the HTML renderer in this package into a separate one. That would require converting the parser tests to use AST output, which is also independently sensible.