tlaplus-community / tree-sitter-tlaplus

A tree-sitter grammar for TLA⁺ and PlusCal
MIT License
57 stars 10 forks source link

Support nested block comments - requires external scanner #15

Closed ahelwer closed 3 years ago

ahelwer commented 3 years ago

TLA+ has a feature where you can nest block comments:

---- MODULE Test ----
(* this
  (* is
    (* a *)
  *) nested
  (* block *)
comment *)
====

This is useful for pluscal and also being able to comment out large sections of code which already contains block comments. Tree-sitter can get close to parsing this as follows:

block_comment: $ => seq(
  '(*', repeat(/([^(*]|\([^*]|\*[^)])*/), '*)'
),

where block_comment is an extra; since there can be arbitrary extras between the various tokens of that sequence (which is itself an extra), nested block comments (even with siblings!) are parsed quite nicely. However, key to this is that the middle regex does not capture the final *) or (* token. Unfortunately this sinks the entire ship. No matter how tricky you make the middle regex, it fundamentally requires two characters of lookahead to tell whether it should stop capturing the group; the above regex is defeated by the strings text ((* and text **). Things would be easy if you could just capture the final tokens! Consider this DFA (created with this tool), where ^ indicates a character that is not (, *, or ): image

Even though this is only LR(2) instead of LR(1), I believe the only way to accomplish our goal is to move that regex into the external scanner. See discussion: https://github.com/tree-sitter/tree-sitter/discussions/1252