stsewd / tree-sitter-rst

reStructuredText grammar for tree-sitter
https://stsewd.dev/tree-sitter-rst/
MIT License
50 stars 7 forks source link

Allow NBSP (`0xA0`) character around `*` and other markers #13

Closed cpkio closed 3 years ago

cpkio commented 3 years ago

Example

This text *should be parsed* — correctly.

= This text should be parsed0xA0— correctly.

This text ``should`` be parsed correctly too.

= This text should0xA0be parsed correctly too.

There a few spacing characters in Unicode, non-breaking space is one most commonly used to avoid breaking word a from dash or two words together.

stsewd commented 3 years ago

Hi, thanks for the report! Sorry for the late reply. I'm able to replicate this, I'll try to find a fix.

cpkio commented 3 years ago

I think this commit is not enough. Probably you should add || scanner->lookahead == CHAR_NBSP to src\tree_sitter_rst\parser.c at line 17, which fixes the issue for me.

stsewd commented 3 years ago

I think this commit is not enough. Probably you should add || scanner->lookahead == CHAR_NBSP to src\tree_sitter_rst\parser.c at line 17, which fixes the issue for me.

Yeah, that probably needs to be updated. Can you provide the text that's failing to parse? So I can add it to the test cases.

cpkio commented 3 years ago

Try this. As you can see, I use 0x2014 after 0xA0.

- *Etiam* — sodales lacus at arcu dapibus pellentesque. Morbi consectetur dictum
  ipsum, consectetur tempus ante venenatis et. Praesent maximus aliquam ex,
  quis luctus ipsum pellentesque vel.

- Donec **sed arcu dolor** — In ex ligula, finibus id justo nec, luctus consectetur
  sapien. Phasellus convallis rutrum nisl ac aliquet. Nulla mollis nec velit
  placerat venenatis. Nunc sit amet arcu vel turpis tristique pellentesque at
  nec nisl. Curabitur non facilisis ligula, nec semper mi.

- Donec vehicula lacus in ``ligula vestibulum`` — a fringilla ligula
  elementum. Donec volutpat auctor nunc vitae fermentum. Suspendisse a felis
  nec ligula porttitor tincidunt. Nam nec dapibus mauris.
stsewd commented 3 years ago

@cpkio it makes sense to me to add that check there. But I wasn't able to reproduce with your example. Actually, github replaces the NBSP chars with spaces, so I'm not sure if I'm testing the same file.

Here is the file where a manually replaced the spaces with NBSP chars before the em-dash.

test.txt

Could you share it as a file so github doesn't replace them?

cpkio commented 3 years ago

Is this ok?

stsewd commented 3 years ago

This is what I get with the file you sent.

Screenshot-20210928222218-1693x300

cpkio commented 3 years ago

If you're using rst parser built from current master (no src\tree_sitter_rst\parser.c fix as I've suggested), that's strange. I've checked once again — made isolated Neovim, cloned my repo, installed rst parser with :TSInstall rst and checked the playground. Still 0xA0 breaks the formatting. I'm on Neovim 0.5.1 release on Windows x64.

(Pay attention there are no rogue rst.* in \start\nvim-treesitter\parser folder, TS plugin seems to load any rst.* it finds first).

stsewd commented 3 years ago

I get the same CST when running outside neovim as well.

$ npm run parse -- nbsp-after-inline-markup.rst

(document [0, 0] - [12, 0]
  (paragraph [0, 0] - [2, 35]
    (emphasis [0, 0] - [0, 7]))
  (paragraph [4, 0] - [7, 56]
    (strong [4, 7] - [4, 25]))
  (paragraph [9, 0] - [11, 55]
    (literal [9, 24] - [9, 45])))

I may try booting into a virtual Windows machine, maybe a problem with the \r\n line termination. But I do have set fileformat=dos on my neovim when opening that file.

You could also try adding a test that fails around here

https://github.com/stsewd/tree-sitter-rst/blob/a9281c2250e0d32750de159413cdd87120407389/test/corpus/inline_markup.txt#L28-L28

and run it with npm run build && npm run test

stsewd commented 3 years ago

So, just for curiosity run the CI on windows https://github.com/stsewd/tree-sitter-rst/pull/15, looks like tests fail on windows for some reason.

stsewd commented 3 years ago

OK, looks like the bug was related to \r\n line termination on Windows. Fixed this on https://github.com/stsewd/tree-sitter-rst/pull/15. Hopefully there aren't \r chars without a \n in the wild... This should improve your experience of the parser on windows :+1:

cpkio commented 3 years ago

Thanks. Now fresh master clone works fine if compiled by me.

Still, if the grammar is installed from Neovim with :TSInstall rst, it is not working. But it is not your problem, obviously the nvim-treesitter build script is wrong, though it is using the same toolchain as my manual builds. I'm building with gcc --std=c99 -o rst.so -shared src/parser.c src/scanner.c -Os -Isrc/ command.

stsewd commented 3 years ago

@cpkio the lockfile needs to be updated in order to use the new version of the parser https://github.com/nvim-treesitter/nvim-treesitter/pull/1895

cpkio commented 3 years ago

This resolves the issue.