stsewd / tree-sitter-rst

reStructuredText grammar for tree-sitter
https://stsewd.dev/tree-sitter-rst/
MIT License
50 stars 7 forks source link

Scanner should recognize non-ASCII punctuation chars #53

Closed SilverRainZ closed 5 months ago

SilverRainZ commented 6 months ago

Hi stsewd, thank for your awesome rst parser!

I found this parser works no so well when parse documentation written in CJK. For example: :strong:`text`。 (trailing with a Chinese full stop , in Engish it is .) is a valid inline markup (OK for rst2pseudoxml), but can not be correctly recognize by tree-sitter-rst.

How to reproduce

$ echo ':strong:`text`。' > example.rst
$ rst2pseudoxml example.rst
<document source="example.rst">
    <paragraph>
        <strong>
            text
        。
$ tree-sitter p example.rst
(document [0, 0] - [1, 0]
  (ERROR [0, 0] - [0, 8]
    (role [0, 0] - [0, 8]))
  (paragraph [0, 8] - [0, 17]))
example.rst        0.03 ms         607 bytes/ms (ERROR [0, 0] - [0, 8])

How to fix

According to Inline markup recognition rules:

Inline markup start-strings must start a text block or be immediately preceded by

  • whitespace,
  • one of the ASCII characters - : / ' " < ( [ {
  • or a similar non-ASCII punctuation character. [18]

Inline markup end-strings must end a text block or be immediately followed by

  • whitespace,
  • one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } >
  • or a similar non-ASCII punctuation character. [19]

I have make a PR(#10) for this, but it is not a good fix. Docutils provides some regex for matching these non-ASCII punctuation characters. According to my current understanding, matching them in src/tree_sitter_rst/chars.c::is_{start,end}_char should fix this issue.

SilverRainZ commented 5 months ago

Just FYI, I am working on this, by generating C chars array from docutils.utils.punctuation_chars, and replacing the valid_chars inside is_{start,end}_char function.

I will file PR soon.