tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
155 stars 36 forks source link

rewrite the grammar #29

Closed tek closed 3 years ago

tek commented 3 years ago

hello :wave: I rewrote the grammar and it's working quite nicely. There's still some stuff to do, but at this point I'm opening this PR to get some advice on one specific problem that seems impossible to me.

The issue is preprocessor macros, as in:

f = do
  do a
#ifdef foo
  a
#else
    a <- a
  a
#endif

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again. I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

tek commented 3 years ago

@rewinfrey @maxbrunsfeld @philipturnbull

ahlinc commented 3 years ago

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

As I understand tree-sitter's behavior there is no such option for manual controlling the parser's state. The parser makes this automatically and there only one option to correctly save the external scanner state through serialization and restore through deserialization. In the deserialization moment the external scanner gets the buffer with a state to which the parser decided to make the restore.

tek commented 3 years ago

hm, bummer. I looked at how the C grammar handles #ifdefs, but it's not apparent to me that there is a similar situation.

tek commented 3 years ago

I naively imagine functionality in tree-sitter.h where I can call get_current_state and receive the state number, then later call shift(old_state) and it will magically continue parsing :innocent: guess there's more to it than that

ahlinc commented 3 years ago

@tek regarding this PR contents:

$ git fetch origin refs/pull/29/head
...................................................................
 * branch            refs/pull/29/head -> FETCH_HEAD

$ git log --stat FETCH_HEAD
commit 5b7947c562b9b2b5d844c04351eacc331a6bbb86

    structure grammar into multiple files

 grammar.js          |   1313 +-
 grammar/basic.js    |     98 +
 grammar/class.js    |     99 +
 grammar/data.js     |    147 +
 grammar/decl.js     |     81 +
 grammar/exp.js      |    241 +
 grammar/id.js       |    126 +
 grammar/import.js   |     30 +
 grammar/misc.js     |     12 +
 grammar/module.js   |     37 +
 grammar/pat.js      |     95 +
 grammar/pattern.js  |     32 +
 grammar/type.js     |    201 +
 grammar/util.js     |     66 +
 src/grammar.json    |   7391 +++--
 src/node-types.json |     17 +-
 src/parser.c        | 298845 +++++++++++++++++++---------------------------------
 17 files changed, 154611 insertions(+), 154220 deletions(-)

commit 62d7ebf592a9491d4d44273fceb1390b85d148b1 (refs/pull/29/head)

    rewrite the grammar

...................................................................

  63 files changed, 296573 insertions(+), 484667 deletions(-)

With such amount of changes it's too hard to understand what is an adaptation of the repo to the current tree-sitter version and what is improvements of the grammar. Also splitting the grammar.js to several files in my opinion is controversial change. For better chances to successful approval of changes to this repo by maintainers I'd suggest to split the work into several PRs:

  1. Remove tree-sitter generated stuff from the repo, there are plans to do this in the tree-sitter repo, issue 930:

    Cleanup - Remove generated files from all the grammar repos in the tree-sitter org

  2. Adapt the repo to the current tree-sitter version, move corpus to test/corpus, etc.
  3. Improvements to the grammar.js and related files.
tek commented 3 years ago

@ahlinc it's a complete rewrite, I wouldn't know how to split this according to your suggestion.

  1. got it!
  2. since there is no overlap with the original tests, how would that look?

Why is it controversial to split the files?

ahlinc commented 3 years ago

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again. I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

@tek I'm not Haskell expert so if you can give here a minimal correct version of a Haskell program with such preprocessor construction, may be I'll been able to help you with a way how to handle your preprocessor issue.

tek commented 3 years ago
{-# language CPP #-}

f :: IO Int
f = do
  _ <- do
    pure ()
#ifdef foo
  pure 1
#else
    pure ()
  pure 1
#endif

@ahlinc this should compile!

ahlinc commented 3 years ago

2. since there is no overlap with the original tests, how would that look?

Didn't know that, it's hard to understand in the huge PR.

Why is it controversial to split the files?

It's just my opinion, I didn't see such splits in any tree-sitter grammar repos and it's easier for me to navigate in a one file by a word search instead of several files. Also it seems that grammar.js files looks too tricky for IDEs and I don't know any that would completely help with the navigation in them. Use of Typescript for tree-sitter grammars I didn't tried yet.

ahlinc commented 3 years ago

@tek thanks for the program snippet, I'll take a look today or this weekend.

tek commented 3 years ago

I just now split the files after working on the project for a while, because it was quite unergonomic to navigate with your suggested word search, since lots of rules have strings like exp and pat in them. I'm not married to the idea, if it creates problems I'll change it back.

Why do you suspect this will be problematic with IDEs?

What about Typescript? Are you suggesting to use it?

tek commented 3 years ago

@tek thanks for the program snippet, I'll take a look today or this weekend.

awesome, thanks!

tek commented 3 years ago

At least coc-tsserver can navigate to identifiers in the project (though since $ is used everywhere, that's very few)

ahlinc commented 3 years ago

Why do you suspect this will be problematic with IDEs?

For now I use VSCode as the main editor and has good completion in grammar.js for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js constructions.

What about Typescript? Are you suggesting to use it?

I think about it as a way to solve my issues with navigation in the grammar.js files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc compiler before each run of tree-sitter generate.

tek commented 3 years ago

For now I use VSCode as the main editor and has good completion in grammar.js for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js constructions.

Is this a general problem or does it relate to this project, specifically the file separation?

In the former case, I can't imagine how an IDE would be able to navigate to rules since they are universally referenced as attributes of the $ object, which is unknown in the project due to being part of tree-sitter-cli's machinery.

I think about it as a way to solve my issues with navigation in the grammar.js files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc compiler before each run of tree-sitter generate.

So you would give the $ a type? Could you make that refer to top level attributes?

tek commented 3 years ago

What do I need to configure for tree-sitter test finding the language for a highlights test? I placed a haskell file in test/highlight and added this to package.json:

  "tree-sitter": {
    "scope": "source.hs",
    "file-types": [
      "hs"
    ],
    "highlights": [
      "queries/highlights.scm"
    ],
    "injection-regex": "^(hs|haskell)$"
  }

but it says No language found for path.

maxbrunsfeld commented 3 years ago

It might have to do with finding the parser repo itself? https://tree-sitter.github.io/tree-sitter/syntax-highlighting#paths

This UX could probably be made smoother.

tek commented 3 years ago

@maxbrunsfeld I have

  "parser-directories": [
    "/home/tek/code/tek/js"
  ],

in my config, and the repo path is /home/tek/code/tek/js/tree-sitter-haskell. that should be correct, no?

ahlinc commented 3 years ago

About processing preprocessor directives

I was able to break mentioned https://github.com/tree-sitter/tree-sitter-c C language parser with a simple C program where a directive splits function call rule. @tek this looks the same type of problem that you are trying to solve in this Haskell parser. Example:

#include <stdio.h>

#define FOO

int main(int argc, char const *argv[])
{
#ifdef FOO
    printf
#endif
    ("Hello world!\n");
    return 0;
}

With current tree-sitter abilities it looks for me impossible to implement correct tree-sitter parser in a one tree-sitter's language defined by a one grammar.js file that would be able to process all possible combinations of rules and their splits by preprocessor rules. Even languages like C/C++ don't process preprocessor directives in the one parser, they splits such processing into two independent phases. The C/C++ preporcessor has one important feature that it decides that preprocessor directive in everything till the end of line and when preprocessor replaces condition directives it replaces them with \n so it's impossible to make split of the grammar token what simplifies situation a bit, but I can imagine a language where some preprocessor directives may split anything, any token that consist more than a one character.

What I can suggest with current tree-sitter abilities is to:

  1. Consider such Haskell code as multi language code: https://tree-sitter.github.io/tree-sitter/using-parsers#multi-language-documents
  2. Don't try to parse preprocessor directives in a one tree-sitter language definition. It seems impossible for now to write correct parser implementation and external scanners can't help in any way.
  3. Implement preprocessor directives processing in a separate parser like this implemented in real C/C++ compilers.
  4. Separate preprocessor parser may help to make decision what code blocks are valid for some program state or detect code blocks and produce their combinations that may be parsed separately.

All above is just my opinion and understanding how tree-sitter works and what limitations it has for now. Also I believe that it's possible to improve tree-sitter so it will be able to parse even such code with preprocessor constructions.

tek commented 3 years ago

I had the same impression about the C grammar. Thanks for your suggestion with the multi-language feature, I'll try to implement something with that.

Howevery, I'd suggest to not make this part of this PR.

@maxbrunsfeld do you think it would be feasible to add functionality to the scanner API (or grammar) with which a state shift could be transparently enforced in order to reset to a previous preprocessor directive? Any other additional information that might help?

maxbrunsfeld commented 3 years ago

I think we’re always going to have to treat the preprocessor in an approximate way. It’s ok IMO if certain preprocessor patterns cause parse errors. Hopefully the errors will be recoverable and we’ll get useful results the vast majority of the time.

tek commented 3 years ago

@maxbrunsfeld so if the parser would execute a shift from an external stimulus, would it corrupt its overall state, or the parse tree?

tek commented 3 years ago

those named precedences are quite the game changer!

tek commented 3 years ago

question: what is the general policy when one rule can be substituted for another, but is more restrictive? Like in this case, the simpletype rule can be parsed by the btype rule, but not all parts of the latter are legal in the positions where the former is used. The parse tree is identical, so is it praxis to use the more general rule, to avoid conflicts and be more permissive? or is it better to map the language precisely?

maxbrunsfeld commented 3 years ago

what is the general policy when one rule can be substituted for another, but is more restrictive?

I think that it is ok to parse a superset of the language if it helps you to avoid conflicts, or makes the grammar simpler. Sometimes I look at the generated parser's STATE_COUNT also, to see which option will produce a smaller binary.

tek commented 3 years ago

alright, all the TODOs are fixed, the realworld test projects that were included in the original now parse 100%! @maxbrunsfeld Please advise what else is needed for this to be merged.

I implemented the preprocessor directives so that the code in #else blocks is just consumed by the cpp rule. I'll need to look deeper into how solution proposed by @ahlinc woud work, but not for this PR.

No idea what's going on with the travis config, does someone know how it was supposed to work?

tek commented 3 years ago
1. **Unabbreviated Naming**

I saw that it was like that in the original, but I found it quite hard to parse, especially in the tests. There is a huge number of similarly named constructs like qualified_symbolic_type_constructor, is it really ergonomic to have those in parse trees? I haven't used tree-sitter for anything besides trying out the highlighting in neovim, so I can't relate to how it's commonly used.

I have quite a few inconsistencies in the structural node design, for example for promoted types, which can apply to a number of rules, I used the alias promoted as a wrapper for the actual node, as in (promoted (type_literal (con_unit))) – as opposed to promoted_con_unit. Would that be sensible for others like qualified identifiers, as in (qualified (modid) (varid))?

2. **Anonymous Operators**

I see, so those aren't useful to have named?

3. **Avoiding Excessive Wrapping** -
   * We'd often define something like `modid` (which I would suggest renaming to `module_identifier`) 
      as an _alias_ of `conid`, instead of its own parent rule.

Even if the alias has to be used in multiple locations?

   * We would try to structure the grammar so that we only create a `qtyconsym` (`qualified_type_constructor`?)
      if the symbol actually _is_ qualified (e.g. it has a parent module and a `.`).

that's already the case in some rules, I'll fix any that are remaining:

tyconsym: $ => $._tyconsym,
qtyconsym: $ => qualified($, $.tyconsym),
_qtyconsym: $ => choice($.qtyconsym, $.tyconsym),   

thanks for the feedback!

maxbrunsfeld commented 3 years ago

I see, so those aren't useful to have named?

Even if they aren't named, they are still accessible in the tree, and you can still match on them with queries. I think it's just simpler to have fewer total names. There's something self-describing about the node's type just being =.

Even if the alias has to be used in multiple locations?

Yeah, good question. To deal with this, we sometimes use a pattern like this:


inlines $ => [
  $._module_identifier,
],

rules: $ => {
  // Create an "inlined" rule that expresses the alias
  _module_identifier: $ => alias($.constructor_identifier, $.module_identifier),

  // Then, everywhere else, we refer to it by ☝️  this rule:
  rule_a: $ => seq(
    // ...
    $._module_identifier,
    // ...
  ),

  rule_b: $ => choice(
    // ...
    $._module_identifier,
    // ...
  )
}
tek commented 3 years ago

tests now run fine on github actions!

tek commented 3 years ago

@maxbrunsfeld regarding the identifiers:

  1. Is the goal to have tree nodes that are understandable for an end user looking at the tree? If not, what is the intended UX of the javascript identifiers for a developer?
  2. Some of the hidden rules, like _aexp, are named exactly like in the reference doc, in order to easily find the corresponding production rules there when debugging. Would you agree that this helpful to keep? https://www.haskell.org/onlinereport/haskell2010/haskellch10.html
  3. Would a mix of "using names from the spec" and "using natural language" be acceptable? Something like:
  _tyconid: $ => alias($.conid, $.type_constructor),
  qualified_type_constructor: $ => qualified($, $._tyconid),
  _qtycon: $ => choice($.qualified_type_constructor, $._tyconid),

  type_operator: $ => $._tyconsym,
  qualified_type_operator: $ => qualified($, $.type_operator),
  _qtyconsym: $ => choice($.qualified_type_operator, $.type_operator),
  1. In a case like import Foo (..), the two dots represent "all names", as opposed to import Foo(), "only instances"; but omitting the dots from the tree would result in identical nodes. I would alias the dots as all_names, is that acceptable?
  2. Personally I would consider pattern_list, pattern_strict to be more readable than list_pattern, strict_pattern, since it allows me to parse the "category" of the nodes first, especially when there are many of them in succession, since it forms a regular pattern. What are your thoughts?
googleson78 commented 3 years ago

This is just my personal opinion, but the long names would only be useful for me when I'm only beginning with the grammar.

Anything past that, they would harm readability - when I start working with something my brain falls into "pattern recognition mode", instead of "I'm genuinely reading and understanding what type_family_instance_declaration means every time", and the long names hurt this - it takes more time to parse and get to the actually significant of the name.

Furthermore, the generated trees can get quite big and hard to trace, at least for me personally.

I think some abbreviated forms with a "abbreviations guide" intended for newcomers is a good compromise in this area.

(notably I haven't had much experience, so maybe I'm missing some crucial benefit to the long names)

tek commented 3 years ago

I added some aliases and renames for tyconid, conid and top level declarations. Is this pattern more desirable?

patrickt commented 3 years ago

@tek I’m a +1 on more human-targeted names for the public interface and using the Report’s grammar names for the hidden fields. Nice one 👍🏻

tek commented 3 years ago

sounds like a compromise!

patrickt commented 3 years ago

Interestingly enough, I can’t get this to build via npm install on macOS… mysterious errors coming out of the C++ toolchain, even when specifying CXX=g++-9 or -10. Curiously, tree-sitter test and tree-sitter parse work fine.

patrickt commented 3 years ago

Really thrilled that this parses code written with -XBlockArguments. Amazing!

tek commented 3 years ago

Really thrilled that this parses code written with -XBlockArguments. Amazing!

that was a tricky one :sweat_smile:

tek commented 3 years ago

Interestingly enough, I can’t get this to build via npm install on macOS… mysterious errors coming out of the C++ toolchain, even when specifying CXX=g++-9 or -10. Curiously, tree-sitter test and tree-sitter parse work fine.

I had to move to GA because I couldn't get it to build on Travis. I'll try adding a macos action.

patrickt commented 3 years ago

@tek Moving to GA is the right move anyway, since Travis is imposing OSS limits, and GA is better anyway.

patrickt commented 3 years ago

I think the problem with npm install on macOS may be a result of libc++ vs libstdc++, but I’m not sure.

tek commented 3 years ago

it also happens in GA:

g++-9: error: unrecognized command line option '-stdlib=libc++'

any idea who sets this option?

patrickt commented 3 years ago

I think that’s coming from somewhere inside Node or node-gyp. What’s really confusing is that using clang++ --std=c++17 produces, on my machine, errors that come from inside the functional header itself:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1478:53: error: use of overloaded operator '!' is ambiguous
      (operand type 'const function<bool (State)>')
bool __not_null(function<_Fp> const& __f) { return !!__f; }
                                                    ^~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1772:25: note: in instantiation of function template specialization
      'std::__1::__function::__not_null<bool (State)>' requested here
        if (__function::__not_null(__f))
                        ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1795:11: note: in instantiation of function template specialization
      'std::__1::__function::__value_func<bool (const State &)>::__value_func<std::__1::function<bool (State)>, std::__1::allocator<std::__1::function<bool (State)> > >' requested here
        : __value_func(std::forward<_Fp>(__f), allocator<_Fp>()) {}
          ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:2491:50: note: in instantiation of function template specialization
      'std::__1::__function::__value_func<bool (const State &)>::__value_func<std::__1::function<bool (State)>, void>' requested here
function<_Rp(_ArgTypes...)>::function(_Fp __f) : __f_(_VSTD::move(__f)) {}
                                                 ^
../src/scanner.cc:685:69: note: in instantiation of function template specialization 'std::__1::function<bool (const State &)>::function<std::__1::function<bool (State)>, void>' requested here
Parser either(bool c, Parser match, Parser nomatch) { return either(const_<State>(c), match, nomatch); }
                                                                    ^
../src/scanner.cc:304:11: note: candidate function
Condition operator!(const Condition & c) { return [=](auto state) { return !c(state); }; }
          ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1478:53: note: built-in candidate operator!(_Bool)
bool __not_null(function<_Fp> const& __f) { return !!__f; }
                                                    ^
tek commented 3 years ago

:dizzy_face:

tek commented 3 years ago

oh, because I overload operator!

tek commented 3 years ago

guess it only works for c++14 like that :(

tek commented 3 years ago

I'll use a function instead

patrickt commented 3 years ago

This isn’t in your code though, right? It looks like it’s my headers’ problem

patrickt commented 3 years ago

But I can use <functional> from an example test file without my headers dying… hmm 🤔

tek commented 3 years ago

I'd say that when I override operator!(function<>), it will also be used in the stdlib when using a template from my code

tek commented 3 years ago

like an orphan instance colliding with the default :smile: