I found this parser works no so well when parse documentation written in CJK.
For example: :strong:`text`。 (trailing with a Chinese full stop 。, in Engish it is .) is a valid inline markup (OK for rst2pseudoxml), but can not be correctly recognize by tree-sitter-rst.
Inline markup start-strings must start a text block or be immediately preceded by
whitespace,
one of the ASCII characters - : / ' " < ( [ {
or a similar non-ASCII punctuation character. [18]
Inline markup end-strings must end a text block or be immediately followed by
whitespace,
one of the ASCII characters - . , : ; ! ? \ / ' " ) ] } >
or a similar non-ASCII punctuation character. [19]
I have make a PR(#10) for this, but it is not a good fix.
Docutils provides some regex for matching these non-ASCII punctuation characters. According to my current understanding, matching them in src/tree_sitter_rst/chars.c::is_{start,end}_char should fix this issue.
Just FYI, I am working on this, by generating C chars array from docutils.utils.punctuation_chars, and replacing the valid_chars inside is_{start,end}_char function.
Hi stsewd, thank for your awesome rst parser!
I found this parser works no so well when parse documentation written in CJK. For example:
:strong:`text`。
(trailing with a Chinese full stop。
, in Engish it is.
) is a valid inline markup (OK forrst2pseudoxml
), but can not be correctly recognize by tree-sitter-rst.How to reproduce
How to fix
According to Inline markup recognition rules:
I have make a PR(#10) for this, but it is not a good fix. Docutils provides some regex for matching these non-ASCII punctuation characters. According to my current understanding, matching them in
src/tree_sitter_rst/chars.c::is_{start,end}_char
should fix this issue.