tlaplus-community / tree-sitter-tlaplus

A tree-sitter grammar for TLA⁺ and PlusCal
MIT License
57 stars 10 forks source link

PlusCal support #45

Closed susliko closed 2 years ago

susliko commented 2 years ago

https://github.com/tlaplus-community/tree-sitter-tlaplus/issues/31

Block comment detection is altered from the external parser to the solution proposed in this comment. This allows for straightforward implementation of pluscal support.

@ahelwer, would be happy to hear from you

If we agree on this approach I'll proceed with docs, tests and syntax highlighting

Example:

---- MODULE simple ----

EXTENDS TLC, Integers

(* *)   
(* comment *)      
     (* comment *)      
     (* comment *)      
     (* comment *)      
(* comment *)      
(* comment *)      
(* comment *)      
(* comment *)      

(* another comment *)
(*
just
text
(* this (* is a nested (* comment *) *)  *)

--algorithm simple
variables alice_account = 10, bob_account = 10, money = 5;
(* comment inside algorithm *)
begin
A: alice_account := alice_account - money;
B: bob_account := bob_account + money;
end algorithm; 
text
after algorithm
*)

====

Tree-sitter-output:

(source_file [0, 0] - [34, 0]
  (module [0, 0] - [32, 4]
    (header_line [0, 0] - [0, 4])
    name: (identifier [0, 12] - [0, 18])
    (header_line [0, 19] - [0, 23])
    (extends [2, 0] - [2, 21]
      (identifier_ref [2, 8] - [2, 11])
      (identifier_ref [2, 13] - [2, 21]))
    (block_comment [4, 0] - [12, 13]
      (block_comment_text [4, 2] - [12, 10]))
    (block_comment [15, 0] - [30, 2]
      (block_comment_text [15, 2] - [18, 4])
      (block_comment [19, 0] - [19, 43]
        (block_comment_text [19, 2] - [19, 7])
        (block_comment [19, 7] - [19, 39]
          (block_comment_text [19, 10] - [19, 22])
          (block_comment [19, 22] - [19, 36]
            (block_comment_text [19, 25] - [19, 33]))))
      (pcal_algorithm [21, 0] - [27, 14]
        algorithm_name: (identifier [21, 12] - [21, 18])
        (pcal_var_decls [22, 0] - [22, 58]
          (pcal_var_decl [22, 10] - [22, 29]
            (pcal_variable [22, 10] - [22, 23]
              (identifier [22, 10] - [22, 23]))
            (nat_number [22, 26] - [22, 28]))
          (pcal_var_decl [22, 30] - [22, 47]
            (pcal_variable [22, 30] - [22, 41]
              (identifier [22, 30] - [22, 41]))
            (nat_number [22, 44] - [22, 46]))
          (pcal_var_decl [22, 48] - [22, 58]
            (pcal_variable [22, 48] - [22, 53]
              (identifier [22, 48] - [22, 53]))
            (nat_number [22, 56] - [22, 57])))
        (block_comment [23, 0] - [23, 30]
          (block_comment_text [23, 2] - [23, 27]))
        (pcal_algorithm_body [24, 0] - [26, 38]
          (pcal_stmt [25, 0] - [25, 42]
            pcal_statement: (identifier [25, 0] - [25, 1])
            (pcal_unlabeled_stmt [25, 3] - [25, 42]
              (pcal_assign [25, 3] - [25, 42]
                (pcal_lhs [25, 3] - [25, 16]
                  (pcal_variable [25, 3] - [25, 16]
                    (identifier [25, 3] - [25, 16])))
                (assign [25, 17] - [25, 19])
                (bound_infix_op [25, 20] - [25, 41]
                  lhs: (identifier_ref [25, 20] - [25, 33])
                  symbol: (minus [25, 34] - [25, 35])
                  rhs: (identifier_ref [25, 36] - [25, 41])))))
          (pcal_stmt [26, 0] - [26, 38]
            pcal_statement: (identifier [26, 0] - [26, 1])
            (pcal_unlabeled_stmt [26, 3] - [26, 38]
              (pcal_assign [26, 3] - [26, 38]
                (pcal_lhs [26, 3] - [26, 14]
                  (pcal_variable [26, 3] - [26, 14]
                    (identifier [26, 3] - [26, 14])))
                (assign [26, 15] - [26, 17])
                (bound_infix_op [26, 18] - [26, 37]
                  lhs: (identifier_ref [26, 18] - [26, 29])
                  symbol: (plus [26, 30] - [26, 31])
                  rhs: (identifier_ref [26, 32] - [26, 37])))))))
      (block_comment_text [27, 14] - [29, 15]))
    (double_line [32, 0] - [32, 4])))
ahelwer commented 2 years ago

I'll take a look at this after the 25th hopefully - on holiday at the moment!

ahelwer commented 2 years ago

This approach looks good! Considering how complicated the _expr rule is it makes sense to include the pluscal grammar in this grammar.

susliko commented 2 years ago

@ahelwer I've stumbled upon 2 failing tests with conj(disj)unction lists, which are implemented via external parser. Do you have any ideas on what should be changed to make this work? image

ahelwer commented 2 years ago

I'll take a look at that failing test. This also brings to mind an edge case, which is that:

  1. Block comments can occur inside jlists
  2. PlusCal sections can occur inside block comments
  3. jlists can occur inside PlusCal sections

So I'll have to figure out a way of ensuring that the jlist in (3) doesn't interact weirdly with the jlist in (1).

ahelwer commented 2 years ago

I left a comment on the scanner.cc changes about how the lexing code for block comment tokens (not the full scan_block_comment_text code, but the other stuff) needs to be left in so the non-block-comment external tokens know to bail out when they see a block comment; perhaps that will fix the failing test?

susliko commented 2 years ago

Hm, can't see your comment on scanner.cc changes for some reason Are you sure you've posted it?

ahelwer commented 2 years ago

Doesn't seem like you can link to a review but try scrolling up in this PR. Here's a screenshot, I left a few other comments as well: image

Maybe the force pushes are causing an issue.

susliko commented 2 years ago

There is a 'Pending' comment status on the screenshot. This usually means that the review is not yet submitted (button in the upper right corner) 🤔

ahelwer commented 2 years ago

Well that's confusing. I think it's now in a weird state where it can't be submitted because of the force pushes overwriting commits or something.

Edit: ah no, I was simply confused by github's UI.

ahelwer commented 2 years ago

Fun update - I'm working to get this grammar enabled for highlighting the TLA+ code on github! Plus fuzzy code navigation and such. Your changes will certainly be important for that.

susliko commented 2 years ago

Fun update - I'm working to get this grammar enabled for highlighting the TLA+ code on github! Plus fuzzy code navigation and such. Your changes will certainly be important for that.

I'd love to have a look at your work on enabling this grammar on github 🔥

ahelwer commented 2 years ago

With the latest UT runs it made it all the way to the corpus tests. This is where I run the parser against every file in the https://github.com/tlaplus/examples repo to see whether it passes. Pretty easy to do locally - looks like the pluscal parsing fails on a number of them.

susliko commented 2 years ago

The latest update seems to parse all the specs in the corpus, except for two cases:

  1. disj_list items consuming semicolon at the end of the line:
    await \/ num[nxt] = 0
      \/ <<num[self], self>> \prec <<num[nxt], nxt>> ; \* should not be consumed
  2. Single line comments nested in block comments when the block comment ends at this same line. Nested line comment consumes all characters to the end of the line:
    (* \* *)

btw, I am not allowed to run github workflows in this repo, since I'm a first-time contributor

ahelwer commented 2 years ago

(1) can probably be fixed by a simple change to the external scanner (just have it recognize a ; character then categorize it as an OTHER token). (2) sounds like a puzzle!

Actually for the semicolon issue, what are the desired semantics? Should the semicolon terminate the entire jlist?

susliko commented 2 years ago

Actually for the semicolon issue, what are the desired semantics? Should the semicolon terminate the entire jlist?

Semicolon terminates a PlusCal statement, which contains the jlist. Therefore, it should terminate the jlist. As far as I understand, the lexer should emit the DEDENT token for jlist termination

ahelwer commented 2 years ago

Ah, okay. In that case you want to tokenize the semicolon as Token_TERMINATOR in the external scanner, which will trigger the appropriate jlist handling logic. You can probably figure out how to do this on your own but if you need help I can make the change.

susliko commented 2 years ago

Managed to fix both cases. All the specs in the corpus should be parsed now

ahelwer commented 2 years ago

Brilliant! Would you say this is ready for full review and such, or would you like to make more stylistic changes, write more tests etc.?

Edit: I see a few corpus tests are failing in the CI

susliko commented 2 years ago

I see a few corpus tests are failing in the CI

Indeed. I'll have a look

make more stylistic changes, write more tests etc.

Yeah, there are some tests to be written. I'd also like to explore the idea of getting builtin library operators (like Head and Len from the Sequences) highlighted only when the module EXTENDS this library (there is a draft in nvim's locals.scm 🤯)

ahelwer commented 2 years ago

That highlighting is certainly a possibility, although we can separate it from these changes.

susliko commented 2 years ago

Although the test suite is not yet fully completed, I think the MR is ready to be reviewed 🎉

ahelwer commented 2 years ago

Looks great! Can't wait to see reference-based highlighting working. @hwayne should be interested to hear of this progress.

susliko commented 2 years ago

The test suite is completed now, all comments fixed

ahelwer commented 2 years ago

Great! Feel free to announce this release on the google group thread.

You can also issue a PR against the nvim-treesitter repo and tag me.

susliko commented 2 years ago

@ahelwer Thank you for assisting throughout this work!

ahelwer commented 2 years ago

@susliko thanks for all the hard work! I'm sure many people will appreciate support for pluscal.

ahelwer commented 2 years ago

@susliko could you please backport the changes to the neovim query files to this repo? Thanks!

susliko commented 2 years ago

Yeah, sure