microsoft / vscode-markdown-tm-grammar

VS Code built-in markdown extension's Textmate grammar
MIT License
64 stars 49 forks source link

Markdown: leading Unicode (non-ASCII) whitespace breaks syntax highlighting #131

Closed publictheta closed 1 year ago

publictheta commented 2 years ago

Issue Type: Bug

Steps to Reproduce:

  1. Create a Markdown file.
  2. Copy and paste the following:
Highlighted:

[test]() (None)
 [test]() (U+0020, SPACE)
M[test]() (U+004D, LATIN CAPITAL LETTER M)

Not highlighted:

 [test]() (U+2002, EN SPACE)
 [test]() (U+2003, EM SPACE)
 [test]() (U+3000, IDEOGRAPHIC SPACE)
screenshot

This issue was originally reported as publictheta/vscode-japanese-novel#1 (in Japanese, this extension is by me, but the reporter is not me).

Note

I've not fully investigated, but this could be caused by the inappropriate use of \s and \S in markdown.tmLanguage.json.

From Oniguruma's Documentation (L60-L69):

  \s       whitespace char

           Not Unicode:
             \t, \n, \v, \f, \r, \x20

           Unicode case:
             U+0009, U+000A, U+000B, U+000C, U+000D, U+0085(NEL),
             General_Category -- Line_Separator
                              -- Paragraph_Separator
                              -- Space_Separator

If we may refer to CommonMark, Unicode (non-ASCII) whitespace characters seem to have no special effect except for the delimiter run rule.

VS Code version: Code 1.68.1 (30d9c6cd9483b2cc586687151bcbcd635f373630, 2022-06-14T12:52:13.188Z) OS version: Darwin x64 21.5.0 Restricted Mode: No

Extensions (1) Extension|Author (truncated)|Version ---|---|--- vscode-language-pack-ja|MS-|1.68.6150906