rewrite the grammar - Githubissues

tek commented 3 years ago

hello :wave: I rewrote the grammar and it's working quite nicely. There's still some stuff to do, but at this point I'm opening this PR to get some advice on one specific problem that seems impossible to me.

The issue is preprocessor macros, as in:

f = do
  do a
#ifdef foo
  a
#else
    a <- a
  a
#endif

Here the block inside of the #ifdef ends the current rule, but in the #else it starts inside of that rule again. I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.

Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.

tek commented 3 years ago

try again now

patrickt commented 3 years ago

@tek Works perfectly! Wow, C++ is really something.

tek commented 3 years ago

yeah, I had a blast

tek commented 3 years ago

question. given a type a (A "a"), I get the tree

  (type_apply (type_name (type_variable)) (type_parens (type_apply (type_name (type)) (type_literal (string)))))))

Is it helpful to have the wrappers type_name and type_literal? My reasoning behind it was that both can have multiple variants, like a literal can be Symbol or Nat, but a (string) can also occur in expressions etc. So is it useful to be able to disambiguate those, or should that be done via the parent node, which can be type_apply, type_parens and many more?

maxbrunsfeld commented 3 years ago

I think especially for type_literal, it makes sense to add some wrapping. For (type_name (type)), it's less clear to me that the extra wrapping is needed, but I don't fully understand that part of the grammar.

On another note - with rules like these:

  _larrow: _ => '<-',
  _carrow: _ => '=>',
  _lambda: _ => '\\',

☝️ that will actually make the <- token completely invisible from the tree (as opposed to just showing up as an anonymous node).

The current behavior is that when you give a name to a single token like that, Tree-sitter doesn't create a wrapper node (with the named _larrow node containing the anonymous "<-" node). It unifies them, so that _larrow is the terminal. Since underscore-prefixed nodes are hidden, that will make it impossible to target the <- node with a query for e.g. syntax highlighting purposes. What we usually do is just directly use "<-" in the grammar (in place of having any name like _larrow). I think that'll work better for syntax highlighting, and any use case where you'd want to identify the individual operators.

tek commented 3 years ago

I think especially for type_literal, it makes sense to add some wrapping. For (type_name (type)), it's less clear to me that the extra wrapping is needed, but I don't fully understand that part of the grammar.

So both type and type_variable can occur either in a signature or in their declaration, and I added type_name to signify that the node is in a signature (it's not even consistent, I think, I would have to improve that). I don't know how feasible it would be to query a node based on the wrapping signature and the type_variable name, if the node is nested in other type constructs. Or maybe that isn't necessary at all – as I said, I have no practical experience with querying.

point_up that will actually make the <- token completely invisible from the tree (as opposed to just showing up as an anonymous nodes).

oh, good to know, thanks!

maxbrunsfeld commented 3 years ago

I'd say leave the type_name stuff as it is; this PR is already a major improvement, and we can always come back and tweak the structure later.

tek commented 3 years ago

sounds good!

maxbrunsfeld commented 3 years ago

I do think that before merging, it'd be worth changing those invisible tokens like _larrow though (to just use the strings directly), but let me know if you feel otherwise.

tek commented 3 years ago

absolutely

maxbrunsfeld commented 3 years ago

I think the way you broke the grammar into distinct files for each section is actually pretty neat and tidy. I might try that on some other big grammars.

patrickt commented 3 years ago

Modulo any suggestions @maxbrunsfeld might have, I think this looks good to me. Thank you very much for all your work on this, @tek, especially the lexer, which is delightfully sophisticated. I propose giving you a commit bit to this repo, unless @maxbrunsfeld objects.

tek commented 3 years ago

very kind, thank you! I'd be happy to keep maintaining the project.

tek commented 3 years ago

I'll ping you once I'm done with the finishing touches

maxbrunsfeld commented 3 years ago

Also, just curious - At a high level, why is the external scanner's state a vector<vector<uint16_t>>, as opposed to a flat vector<uint16_t>?

tek commented 3 years ago

@maxbrunsfeld I added that when I was dealing with the preprocessor directives. Since the indentations would have to be reset on an #else, I just pushed a copy onto the stack on an #if. But I only realized afterwards that the same would have to be done with the external parser state. Thanks for reminding me, this can now be reverted!

tek commented 3 years ago

@maxbrunsfeld @patrickt invisible tokens are inlined, double vector is removed, I renamed lots of user-facing nodes and all tests green. if you're satisfied, please merge!

tek commented 3 years ago

:rocket:

maxbrunsfeld commented 3 years ago

Just FYI, I've been doing squash merges on these grammar repos lately, since they contain generated files, to avoid the repo size growing too fast.

Thanks for the awesome work @tek!

tek commented 3 years ago

makes sense. it's been a pleasure!

patrickt commented 3 years ago

Huge thanks, @tek! This is a real step forward for the Haskell ecosystem at large, since this is (I think) the only working GHC Haskell parser outside of GHC itself!

tek commented 3 years ago

omg :joy:

tek commented 3 years ago

@maxbrunsfeld @patrickt Am I supposed to be committing further changes to master?

tek commented 3 years ago

thanks!

rewinfrey commented 3 years ago

Thank you so much @tek for the amazing rewrite! This is huge for the Haskell community, wonderful work 👏 ❤️

tek commented 3 years ago

@rewinfrey very kind, thank you! :heart:

patrickt commented 3 years ago

@tek Re. master: it’s your call. No one’s consuming this repository as of yet, so I don’t see any huge problem with pushing small fixes directly to master. Bigger features etc. are nice to have as PRs.

tek commented 3 years ago

@patrickt sure thing, I was mainly asking whether I'm permitted!

patrickt commented 3 years ago

Yup! I’ve given you maintainer privileges, so you should be able to do most things. Give me a shout if you need anything.

tek commented 3 years ago

will do, thanks!

felixroos commented 10 months ago

Hello everyone, I have just found this thread after unsuccessfully trying to use the https://www.npmjs.com/package/tree-sitter-haskell package, which had its latest publish 5 years ago. Would it be possible to publish a new version containing the changes in this PR? Or would this involve additional work that is specific to the npm package?

tek commented 10 months ago

@maxbrunsfeld you wanna add my account to that package's maintainers?

felixroos commented 10 months ago

@tek thanks!

For anyone landing here: There is also a prebuilt wasm file in this repo. It can be re-built via tree-sitter build-wasm . When I do it on my machine it is 4,5MB, while the prebuilt version is 3MB, so it might not be the newest version?

FYI I built this visualizer: https://felixroos.github.io/haskell-tree-sitter-playground/

tek commented 10 months ago

very nice!

lancejpollard commented 9 months ago

@felixroos So as per the linked issue above, can we somehow npm install this package these days? Thanks!

tree-sitter / tree-sitter-haskell

rewrite the grammar #29