Open crlf0710 opened 1 year ago
Pandoc has the east_asian_line_breaks
extension:
Extension: east_asian_line_breaks Causes newlines within a paragraph to be ignored, rather than being treated as spaces or as hard line breaks, when they occur between two East Asian wide characters. This is a better choice than ignore_line_breaks for texts that include a mix of East Asian wide characters and other characters.
:thinking: this seems like an important thing to make just work by default. At the same time, yeah, this requires Unicode tables. But, as far as I understand, only during rendering, not during parsing.
I wonder if maybe we should just allow usage of various tables during html translation? That way, you can still implement a fully conforming djot parser with very little machinery, but, if you convert to HTML, you do have to handle east asian line breaks and emoji conversion.
I think renderers should be allowed to be as fancy as they want to be. As long as we have the softbreak element in the AST, the necessary information is there, and renderers can do what they like with it.
Eventually it would probably be good to split off the HTML renderer in this package into a separate one. That would require converting the parser tests to use AST output, which is also independently sensible.
One thing that is annoying and in my personal opinion prevented Markdown/CommonMark from being popular in a degree in East Asia, is that soft link breaks are generated between lines, leading to non-expected appearance of text.
For example, in markdown.
Will be rendered as
in browsers other than the expected
I'd love to see this get fixed somehow in Djot, either automatically removal or manually annotated removal is ok i think.
Prior Arts
In existing solutions, like Microsoft's XAML markup language, there's the concept of
Linefeed Collapsing Characters
, when a linefeed has both aLinefeed Collapsing Character
before and after it, the linefeed gets stripped. This works fine to a degree, except the cons that it involves using Unicode tables.