The judgement of CJK characters is incomplete

chawyehsu commented 6 years ago

TEST CASE

expect:

Remove linebreaks between CJK characters
.
你好，
世界
.
<p>你好，世界</p>

actual:

Remove linebreaks between CJK characters
.
你好，
世界
.
<p>你好，
世界</p>

According to the W3 spec: css-text-3/#line-break-transform

Otherwise, if the East Asian Width property [UAX11] of both the character before and after the segment break is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

If the East Asian Width property is F, W, or H (not A), and neither side is Hangul, then the segment break is removed.

I think we should as per what the spec said, use UAX11 unicode-range to judge CJK characters.

AFAIK, there is a package named eastasianwidth that provide the function. After that if either left or right character is Hangul, the segment break should be kept.

puzrin commented 6 years ago

Fixed by https://github.com/markdown-it/markdown-it-cjk-breaks/commit/f187b8d49db9ae2a82762b2c297527e46c1cd393, thanks you for details.

chawyehsu commented 6 years ago

Awesome, thank you guys for this amazing markdown parser!

puzrin commented 6 years ago

Note, it was done quickly, without performance optimizations. May be notable if you need to parse billions documents per second :)

markdown-it / markdown-it-cjk-breaks

The judgement of CJK characters is incomplete #1