github / markup

Determines which markup library to use to render a content file (e.g. README) on GitHub
MIT License
5.87k stars 3.39k forks source link

Spaces around East Asian punctuations in decorated text should not be required #1076

Open ikedas opened 7 years ago

ikedas commented 7 years ago

Problem Description

In East Asian texts in general, word separators (spaces) never be written explicitly. So

> 前の**文字列**の後

should be rendered as

前の文字列の後

(Image) ex000

and in practice this works as expected.

However, if the text fragment to be decorated ends and/or starts with punctuation:

> 前の**前の「文字列」**の後、前の**「文字列」の後**の後、そのあと。
> 
> 前の**「文字列」**の後。

they won't render as expected:

前の前の「文字列」の後、前の「文字列」の後、そのあと。

前の「文字列」の後。

(Image) ex001

Possible workaround is inserting space before or after punctuations ("␣" means space):

> 前の**前の「文字列」**␣の後、前の␣**「文字列」の後**の後、そのあと。
> 
> 前の␣**「文字列」**␣の後。

but it will generate ugry text with an extra space before or after punctuations:

前の前の「文字列」 の後、前の 「文字列」の後の後、そのあと。

前の 「文字列」 の後。

(Image) ex002

Suggested modification

East Asian punctuations should be treated in the way same as normal East Asian characters (Chinese ideographs and so on).

FYI: Almost all of East Asian punctuations are listed here:

kivikakk commented 7 years ago

:wave: Thanks for the report. Please note that the github/markup repository's issues are really just for issues regarding the github-markup gem itself, which doesn't have anything to do with Markdown processing. You'd be better off contacting our support team with these kinds of issues in future, because we have lots of support staff but only a couple busy engineers who monitor this repo.

For this issue specifically, the root cause is in the CommonMark specification, which we adhere to. The section of the specification on emphasis states:

A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, and (b) either not followed by a punctuation character, or preceded by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (a) not preceded by Unicode whitespace, and (b) either not preceded by a punctuation character, or followed by Unicode whitespace or a punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

"punctuation character" is defined as "an ASCII punctuation character or anything in the Unicode classes Pc, Pd, Pe, Pf, Pi, Po, or Ps", and and are in the Ps and Pe categories respectively.

The problem here is that this definition of punctuation character makes sense in the context of the specification if we assume "Unicode whitespace" is a part of the text used (as with most Latin alphabet-derived languages); we expect to see The cat is called "Nodoka". but not 猫は「のどか」という。, where the latter has no space or punctuation character separating the 「」 from the surrounding text.

Hence, when we add emphasis (e.g. around "Nodoka"), we get: The cat is called **"Nodoka"**. but not 猫は**「のどか」**という。

With the English text, the opening ** satisfies the definition of a "left-flanking delimiter run": it is (a) not followed by Unicode whitespace ("), and (b) preceded by Unicode whitespace. The closing ** satisfies the definition of a "right-flanking delimiter run": it is (a) not preceded by Unicode whitespace ("), and (b) followed by a punctuation character (.).

With the Japanese text, however, the opening ** does not satisfy the definition of a "left-flanking delimiter run": it is (a) not followed by Unicode whitespace (), but (b) it is followed by a punctuation character, and it is not preceded by Unicode whitespace or a punctuation character (). Likewise, the closing ** does not satisfy the definition of a "right-flanking delimiter" run: it is (a) not preceded by Unicode whitespace (), but (b) it is preceded by a punctuation character, and it is not followed by Unicode whitespace or punctuation ().

In short, this is a deficiency with the CommonMark specification's handling of East Asian text in general, because of the way the specification assumes interaction between punctuation characters and whitespace characters. I'll raise this issue (along with all the above information) in the CommonMark Discussion forum and work toward a solution.

Thanks for your patience and for the report!

kivikakk commented 7 years ago

Thread opened here: https://talk.commonmark.org/t/emphasis-and-east-asian-text/2491

ikedas commented 7 years ago

@kivikakk thanks. I'll comment on the new thread.

kivikakk commented 6 years ago

It's been over a year and we still haven't had movement here; pinging the upstream repo now.