tree-sitter / tree-sitter-python

Python grammar for tree-sitter
MIT License
360 stars 132 forks source link

Helping injected languages parse docstrings #241

Closed Wuestengecko closed 1 year ago

Wuestengecko commented 1 year ago

Hi! Currently, the tree-sitter-python parser only emits nodes for a string's beginning, middle and end parts. Consider the following sample Python file:

"""A demo module.

This module has a class that does nothing.
"""

class Demo:
    """A demo class.

    It does nothing.
    """

This produces the following named nodes (as seen by Neovim's :InspectTree command):

Treesitter Nodes for the sample Python file
(expression_statement) ; [1:1 - 4:3] python
 (string) ; [1:1 - 4:3] python
  (string_start) ; [1:1 - 3] python
  (string_content) ; [1:4 - 4:0] python
  (string_end) ; [4:1 - 3] python
(class_definition) ; [7:1 - 11:7] python
 name: (identifier) ; [7:7 - 10] python
 body: (block) ; [8:5 - 11:7] python
  (expression_statement) ; [8:5 - 11:7] python
   (string) ; [8:5 - 11:7] python
    (string_start) ; [8:5 - 7] python
    (string_content) ; [8:8 - 11:4] python
    (string_end) ; [11:5 - 7] python

The problem

This works fine if you only want to highlight the entire string contents as "string", but it causes some trouble with injected languages. For example, when injecting rst into docstrings, it gets confused by the leading four space indentation, and interprets everything as block quote, which messes with the highlighting quite a bit:

Treesitter Nodes with injected rst
(expression_statement) ; [1:1 - 4:3] python
 (string) ; [1:1 - 4:3] python
  (string_start) ; [1:1 - 3] python
  (string_content) ; [1:4 - 4:0] python
   (paragraph) ; [1:4 - 17] rst
   (paragraph) ; [3:1 - 42] rst
  (string_end) ; [4:1 - 3] python
(class_definition) ; [7:1 - 11:7] python
 name: (identifier) ; [7:7 - 10] python
 body: (block) ; [8:5 - 11:7] python
  (expression_statement) ; [8:5 - 11:7] python
   (string) ; [8:5 - 11:7] python
    (string_start) ; [8:5 - 7] python
    (string_content) ; [8:8 - 11:4] python
     (paragraph) ; [8:8 - 26] rst
     (block_quote) ; [10:5 - 20] rst
      (paragraph) ; [10:5 - 20] rst
    (string_end) ; [11:5 - 7] python

As you can see, the contents of the module docstring were correctly identified as two (paragraph)s, however the second paragraph of the class docstring was misidentified as (block_quote) (containing a (paragraph)). This also breaks highlighting for e.g. titles (underlining must start at the beginning of the line), and additionally will cause almost the entire docstring contents to be highlighted with the "rst block quote" highlighting instead of that for strings.

This was also part of the reason why nvim-treesitter removed rst injections from Python docstrings. (The other reason being that rst is not universally used – note however that Markdown docstrings for instance suffer from the exact same issue.)

A possible solution

If my understanding of this machinery is correct (which, admittedly, it may very well not), there already is a facility which could be used to make the injected language ignore this leading whitespace. When a node is marked as injection which has child nodes from the original language, these nodes' text is not passed over to the injected one. In other words, the Python grammar can "take over" the common leading whitespace by emitting nodes for it, and thus hide it from the injected language.

This is also how injections nested within markdown block quotes can work seamlessly. The > prefix is correctly identified (and highlighted) as part of the block quote, and the injected Python parser doesn't get confused into producing (ERROR) nodes:

some text

> ```python
> class Injected:
>     python = True
> ```

more text

(Evidently, Github is currently not using treesitter to parse this.)

:InspectTree output
(section) ; [1:1 - 9:0] markdown
 (paragraph) ; [1:1 - 2:0] markdown
  (inline) ; [1:1 - 9] markdown
 (block_quote) ; [3:1 - 7:0] markdown
  (block_quote_marker) ; [3:1 - 2] markdown
  (fenced_code_block) ; [3:3 - 7:0] markdown
   (fenced_code_block_delimiter) ; [3:3 - 5] markdown
   (info_string) ; [3:6 - 11] markdown
    (language) ; [3:6 - 11] markdown
   (block_continuation) ; [4:1 - 2] markdown
   (code_fence_content) ; [4:3 - 6:2] markdown
    (class_definition) ; [4:3 - 5:19] python
     name: (identifier) ; [4:9 - 16] python
     body: (block) ; [5:7 - 19] python
      (expression_statement) ; [5:7 - 19] python
       (assignment) ; [5:7 - 19] python
        left: (identifier) ; [5:7 - 12] python
        right: (true) ; [5:16 - 19] python
    (block_continuation) ; [5:1 - 2] markdown
    (block_continuation) ; [6:1 - 2] markdown
   (fenced_code_block_delimiter) ; [6:3 - 5] markdown
 (paragraph) ; [8:1 - 9:0] markdown
  (inline) ; [8:1 - 9] markdown

You can see the markdown parser emitting multiple (block_continuation) nodes which consume the leading > characters (each being col. 1-2 of lines 4-6). In my opinion, leading whitespace in Python multi-line strings is roughly on that same level; in most cases if it's there, it's because it looks better (everything neatly indented), and it either doesn't matter (because e.g. docstring tooling will clean it up automatically) or it's stripped explicitly with something like textwrap.dedent(). And if it does end up being significant to the injected language after all, it's possible to still include it by setting the injection.include-children flag on the capture (in Neovim, that is – although I'd be surprised if other editors didn't have something similar).

Obviously, all this is just wishful thinking on my part so far. What do you think about this idea? Does it make sense in the broader picture to handle this leading whitespace in the python parser? And is this even possible to implement (reasonably efficiently) at all?

amaanq commented 1 year ago

Sorry but trimming the spaces won't happen here, perhaps a custom predicate can be used to trim the spaces and set the ranges for parsing accordingly.