miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
811 stars 113 forks source link

Use CommonMark definition of punctuation charset #141

Closed jaredliw closed 2 years ago

jaredliw commented 2 years ago

Something goes wrong...

Given

**你好世界。**Hello world!

mistletoe's output is

<p><strong>你好世界。</strong>Hello world!</p>

while it should be (according to CommonMark's dingus):

<p>**你好世界。**Hello world!</p>

Why this happens?

Root cause (core_tokens.py line 9-11):

punctuation = {'!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',',
               '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\',
               ']', '^', '_', '`', '{', '|', '}', '~'}

(Chinese fullstop) is not in the set, therefore ** is regconised as a right-flanking delimiter run (which it shouldn't be).

We should use a broader punctuation charset, including CJK punctuations (and more) as well.

CommonMark Spec:

image

See "Files changed" tab, I left some comments on the modification I made.

Signed-off-by: jaredliw jaredliw@gmail.com