Rewrite the grammar once again

tek commented 5 months ago

I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that have accumulated over the years. The new implementation provides a significant set of improvements, listed below.

Since the repo has become uncloneable in some situations, I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for every commit, like some other grammars do it. However, since the current situation is already bad, it seems that the only way out would be to reset the history, which would break consumers like nixpkgs who rely on being able to access older commits. An alternative could be to use a new Github location, but that would also be awkward. In any case, it's probably better to handle this later.

I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly named. For example, I added header / imports / declarations for the top-level structure. Since I don't have much experience using the grammar via tree-sitter API directly, I've launched a survey on the Haskell discourse to get some more feedback, so I can use the opportunity of introducing breaking changes to improve the grammar for users.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

I'd appreciate opinions and feedback!

Please take a look, @414owen @amaanq @wenkokke

Parses the GHC codebase!

I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.

This used to break horribly in many files because explicit brace layouts weren't supported very well.
Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in test/libs:

Old:
```
effects: 32ms
postgrest: 91ms
ivory: 224ms
polysemy: 84ms
semantic: 1336ms
haskell-language-server: 532ms
flatparse: 45ms
```
New:
```
effects: 29ms
postgrest: 64ms
ivory: 178ms
polysemy: 70ms
semantic: 692ms
haskell-language-server: 390ms
flatparse: 36ms
```
GHC's compiler directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run test/parse-libs. I also added an interface for running hyperfine, exposed as a Nix app – execute nix run .#bench-libs -- stm mtl transformers with the desired set of libraries in test/libs or test/libs/tsh-test-ghc/libraries.
Smaller size of the shared object.

tree-sitter generate produces a haskell.so with a size of 4.4MB for the old grammar, and 3.0MB for the new one.
Smaller size of the generated parser.c: 24MB -> 14MB.
Significantly faster time to generate, and slightly faster build.

On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
Smaller size of parser.c, from 24MB down to 14MB.
All terminals now have proper text nodes when possible, like the . in modules. Fixes #102, #107, #115 (partially?).
Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111.
Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)
Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74.
Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108.
Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116.
Explicit brace layouts are now handled correctly. Fixes #92.

Function application with multiple block arguments is handled correctly. Example:

a = do \ a -> a
  case a of
    a -> a
  do a + a
  if
 | a -> a
 | a -> a
*
do a

Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.
Haddock comments have dedicated nodes now.
Use named precedences instead of closely replicating the GHC parser's productions.
Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.
Fixed CPP bug where mid-line #endif would be false positive.
CPP only matches legal directives now.
Generally more lenient parsing than GHC, and in the presence of errors:
- Missing closing tokens at EOF are tolerated for:
- CPP
- Comment
- TH Quotation
- Multiple semicolons in some positions like if/then
- Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions
List comprehensions can have multiple sets of qualifiers (ParallelListComp).
Deriving clauses after GADTs don't require layout anymore.
Newtype instance heads are working properly now.
Escaping newlines in CPP works now.
One remaining issue is that qualified left sections that contain infix ops are broken: (a + a A.+) I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. Solved this by implementing qualified operator lookahead.
Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.

clason commented 5 months ago

I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for every commit, like some other grammars do it.

In this case, please at least commit the grammar.json, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

You can use tree-sitter build --wasm with the latest version. Please upload that as a release artifact as well.

Important question: will queries remain compatible?

tek commented 5 months ago

In this case, please at least commit the grammar.json, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.

Ah, didn't know that. How do you generate the parser from that file?

You can use tree-sitter build --wasm with the latest version. Please upload that as a release artifact as well.

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?

Important question: will queries remain compatible?

Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be. I think most expression queries will still work, but types and top level structures should be affected.

clason commented 5 months ago

Ah, didn't know that. How do you generate the parser from that file?

Just tree-sitter generate src/grammar.json (little known fact).

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

Possibly because the output dir has changed? But that should be fixed upstream.

I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?

Yes, the Github release artifacts (only; it shouldn't be committed to the repo).

Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be. I think most expression queries will still work, but types and top level structures should be affected.

Thanks for the heads-up. I hope the queries here can serve as a guide for nvim-treesitter.

414owen commented 5 months ago

Wow @tek this sounds incredible, clearly a labour of love, and I'm hugely grateful that you undertook such a huge amount of work for the community. The sheer quantity of listed improvements is astounding.

I'll try my best to look over it, but given the large amount of grammar changes, and that I've mostly just dealt with the scanner in the past, you're clearly the one who needs to decide when this is ready :)

🚀 🚀 🚀

wenkokke commented 5 months ago

That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.

The reason is that web-tree-sitter doesn't support all of libc, and there's a file in the web-tree-sitter source that determines exactly what symbols are supported.

Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.

clason commented 5 months ago

That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?

tek commented 5 months ago

@414owen thanks so much! I definitely went a bit overboard with the obsession over the last few weeks 😁 I don't think it's necessary to review the scanner in detail – running that fuzzer should be enough to complement the tests.

tek commented 5 months ago

Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.

Ah, I'll make sure to bring the wasm build and release up to speed then

tek commented 5 months ago

FYI @wenkokke it appears that the file with the exports was refactored in https://github.com/tree-sitter/tree-sitter/commit/d2900510f61b814e71ab5e3e0d8b59be2d276a6e, now not containing standard library symbols anymore

edit: Ah! It's in here now, and appears to contain realloc, which your patch added. So we're good?

wenkokke commented 5 months ago

That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?

This isn't a WASM issue, but rather an emscripten feature that lets you not ship the entirety of libc and libc++ for your web app.

That means that tree-sitter, as the entry point, bundles libc and libc++ and must pick what parts they want. They generally pick only those parts they need for there library itself to run.

wenkokke commented 5 months ago

FYI @wenkokke it appears that the file with the exports was refactored in https://github.com/tree-sitter/tree-sitter/commit/d2900510f61b814e71ab5e3e0d8b59be2d276a6e, now not containing standard library symbols anymore

edit: Ah! It's in here now, and appears to contain realloc, which your patch added. So we're good?

Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.

clason commented 5 months ago

They generally pick only those parts they need for there library itself to run.

Not true; they also expose (more and more) for external scanners to use. Hence my recommendation to raise an upstream issue.

mrcjkb commented 5 months ago

Important question: will queries remain compatible?

@clason Since I've worked on a lot of the queries in nvim-treesitter, and also maintain neotest-haskell, which also uses tree-sitter queries, I'll be taking part in the survey. Personally, I'm quite excited about some of the proposed changes and wouldn't mind backward compatibility breaking. And of course, I'd be happy to adjust the queries in nvim-treesitter.

clason commented 5 months ago

@mrcjkb Thank you, then I don't need to worry :) (To be clear, I didn't want to argue for not changing the queries; I just wanted a heads-up how much breakage I should expect so I can brace for it.)

tek commented 5 months ago

Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.

@wenkokke I fiddled with the test runner to get it to work for wasm, and it appears that everything is working – when I run test/parse-libs wasm, all libraries are parsed successfully!

tek commented 5 months ago

@414owen I'll add comments to the scanner later, so you won't have to keep guessing 😄

tek commented 5 months ago

@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in tree_sitter_haskell_external_scanner_scan to avoid having it as a parameter of every function, so it's not thread safe.

wenkokke commented 4 months ago

@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in tree_sitter_haskell_external_scanner_scan to avoid having it as a parameter of every function, so it's not thread safe.

Would this result in problems if the library is used, e.g., to parse the contents of multiple tabs in VSCode concurrently? I think it likely will.

tek commented 4 months ago

Would this result in problems if the library is used, e.g., to parse the contents of multiple tabs in VSCode concurrently? I think it likely will.

I don't see any obvious uses of mutexes in https://github.com/tree-sitter/tree-sitter/blob/master/lib/src/parser.c#L437, so I guess it's up to the editor ☹️

ah well, probably easier to just add all the function arguments back. if we could just use C11 those pointers could get thread-local storage class... 🙄

414owen commented 4 months ago

ah well, probably easier to just add all the function arguments back. if we could just use C11 those pointers could get thread-local storage class... 🙄

I support passing the state along explicitly, as a parameter.

There are real performance concerns to be taken into account.
There are real portability concerns that may or may not be relevant anymore, but I'd rather not find out.

It requires significant support from the linker (ld), dynamic linker (ld.so), and system libraries (libc.so and libpthread.so), so it is not available everywhere
Passing around a state struct pointer explicitly feels idiomatic, and clear.

tek commented 4 months ago

@clason

Just tree-sitter generate src/grammar.json (little known fact).

If you cloned the repo and used tree-sitter generate, would it make much of a difference whether the input was the javascript grammar or grammar.json?

Yes, the Github release artifacts (only; it shouldn't be committed to the repo).

I added a CI job that publishes the wasm artifact to the releases.

FYI @mrcjkb I copied the queries from nvim-treesitter and adapted them, if you want to give them a once-over.

clason commented 4 months ago

If you cloned the repo and used tree-sitter generate, would it make much of a difference whether the input was the javascript grammar or grammar.json?

Depends; for some grammars it is. The big benefit is if the javascript needs to be interpreted with npm (e.g., if you import other grammars); this won't be needed if you generate from the json.

tek commented 4 months ago

Depends; for some grammars it is. The big benefit is if the javascript needs to be interpreted with npm (e.g., if you import other grammars); this won't be needed if you generate from the json.

ah, interesting

amaanq commented 4 months ago

@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in tree_sitter_haskell_external_scanner_scan to avoid having it as a parameter of every function, so it's not thread safe.

Do not do this, it can be executed concurrently in the sense that two different parses might be running, which will affect this global variable, and TLS is also not recommended at all, for reasons Owen mentioned.

tek commented 4 months ago

@amaanq

it can be executed concurrently in the sense that two different parses might be running

thanks! Is it impossible for two threads to concurrently run the scanner on the same tree (i.e. same serialized state)?

tek commented 4 months ago

There are real performance concerns to be taken into account.

FYI: I'm seeing a 7.5% increase in parse time with thread-local!

tek commented 3 months ago

hey @wenkokke I posted a question for you on the discourse, in case you don't check your notifications there ☺️

wenkokke commented 3 months ago

hey @wenkokke I posted a question for you on the discourse, in case you don't check your notifications there ☺️

I extremely don't, thanks for the heads up.

wenkokke commented 3 months ago

Unrelated point, but using field names really really simplifies queries. Does the new grammar use field names?

tek commented 3 months ago

Unrelated point, but using field names really really simplifies queries. Does the new grammar use field names?

I have already added many and plan to provide full coverage!

If you have any opinions about naming or some aspect I don't know about, please let me know

wenkokke commented 3 months ago

From practice, it's great if fields that serve a similar purpose have the dane name, but that kinda forces you to correctly guess what things are similar from the perspective of structured editing, which may not generalize.

clason commented 3 months ago

Friendly reminder to keep the grammar.json in the repo, or nvim-treesitter will not be able to install this parser.

tek commented 3 months ago

Friendly reminder to keep the grammar.json in the repo, or nvim-treesitter will not be able to install this parser.

yep on my list 😅 I'm postponing this till the end to avoid cluttering my dev history

tek commented 3 months ago

alright, I think we can merge tomorrow if there's nothing else

tek commented 3 months ago

here goes nothing

clason commented 3 months ago

So this was merged to the main branch -- will that become the default branch at some point?

tek commented 3 months ago

@clason not sure – for now I would leave master as default so that parser.c is present when the repo is cloned. If you have suggestions for a good system I'm all ears.

(in case it's unclear – the release workflow will generate parser.c and create a commit on master whenever a tag is pushed)

clason commented 3 months ago

Ah, master is your "release branch". Some parsers do that, but I never understood the point. Isn't it easier to have a release workflow that makes an actual release with artifacts (tarball and wasm) for downloading, and keep the repo as "pure source"?

tek commented 3 months ago

I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source tree.

(I've also created artifact release workflows btw.)

clason commented 3 months ago

I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source

Yes, but does it have to be in a git branch? You can just generate the parser.c in-tree and then zip the whole thing up; the output will be identical from cloning the repo (ignoring the .git folder).

tek commented 3 months ago

my motivation was that if someone uses the plain github URL in their application to pull the repo, they'll get the generated parser like it used to be the case in the past.

When I researched this issue I got the impression that some projects depend on this, but if you have better insight into the ecosystem I'd be happy to scrap that workflow.

clason commented 3 months ago

Well, I understand not wanting to make sudden breaking changes on master, that's completely fine. I am just honestly curious why people depend on this specific setup. If you are only committing the parser to master on manual tags/releases, then there should be zero difference from a downstream pov between an archive/<tag>.tar.gz link and a releases/download/<tag>/tree-sitter-haskell.tar.gz link (where the latter is created from the full repo download plus generated parser.c)?

(Unless I misunderstood and you are pushing on every commit to main -- but then you're back at square one with master bloating?)

tek commented 3 months ago

no, you understood that fine. I don't like it either, but that's what I learned 🤷🏻 I saw that tree-sitter-swift had multiple requests to include the generated parser, even though they have a separate release branch! so my reasoning was that keeping master as a legacy branch will avoid having to respond to one of those once a month 😅 at least until it has become the new standard.

Of course you could argue that not playing that game would accelerate adoption...

amaanq commented 3 months ago

You should just check in parser.c into master for the time being, and not have a "main" branch + "release" branch. In the future, we will be getting rid of the need for this but for now, this is the way to go. I also wish you had waited till I checked this out - I'm not too familiar with Haskell so I wouldn't be particularly useful in commenting on the grammar, but I would have like to checked changes around everything else in this PR.

clason commented 3 months ago

I could understand that if people wanted to pull in that branch as a git submodule (in which case 🤢)...

I do think the best strategy is "keep parser.c out of repo (unless it's small), and create release artefacts for 1. the full repo tree plus the generated parser.c and 2. the generated wasm". (You certainly shouldn't commit the generated wasm to the repo...) Downstream users who can't handle generating their own parser shouldn't YOLO track master, either.

tek commented 3 months ago

@amaanq no reason why you shouldn't be able to check the changes – nothing has been released yet.

@clason understandable or not, people will open issues without reading the docs. I'd rather avoid having to deal with it over and over again

clason commented 3 months ago

People will open issues without reading the docs no matter what you do ;)

But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.

tek commented 3 months ago

People will open issues without reading the docs no matter what you do ;)

well they certainly won't open any about parser.c missing in the default branch if I keep it there!

But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.

I would have expected you to be eager to consume the tags or artifacts instead, irrespective of the existence of a legacy branch.

clason commented 3 months ago

We are not set up for that yet (because the ecosystem isn't) but that is definitely where we want to be going.

(But that will require the release tarballs to have the exact same layout as the repo.)

tek commented 3 months ago

oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward

tree-sitter / tree-sitter-haskell

Rewrite the grammar once again #120