Rewrite the grammar once again

tek commented 5 months ago

I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that have accumulated over the years. The new implementation provides a significant set of improvements, listed below.

Since the repo has become uncloneable in some situations, I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for every commit, like some other grammars do it. However, since the current situation is already bad, it seems that the only way out would be to reset the history, which would break consumers like nixpkgs who rely on being able to access older commits. An alternative could be to use a new Github location, but that would also be awkward. In any case, it's probably better to handle this later.

I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly named. For example, I added header / imports / declarations for the top-level structure. Since I don't have much experience using the grammar via tree-sitter API directly, I've launched a survey on the Haskell discourse to get some more feedback, so I can use the opportunity of introducing breaking changes to improve the grammar for users.

Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?

I'd appreciate opinions and feedback!

Please take a look, @414owen @amaanq @wenkokke

Parses the GHC codebase!

I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.

This used to break horribly in many files because explicit brace layouts weren't supported very well.
Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in test/libs:

Old:
```
effects: 32ms
postgrest: 91ms
ivory: 224ms
polysemy: 84ms
semantic: 1336ms
haskell-language-server: 532ms
flatparse: 45ms
```
New:
```
effects: 29ms
postgrest: 64ms
ivory: 178ms
polysemy: 70ms
semantic: 692ms
haskell-language-server: 390ms
flatparse: 36ms
```
GHC's compiler directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run test/parse-libs. I also added an interface for running hyperfine, exposed as a Nix app – execute nix run .#bench-libs -- stm mtl transformers with the desired set of libraries in test/libs or test/libs/tsh-test-ghc/libraries.
Smaller size of the shared object.

tree-sitter generate produces a haskell.so with a size of 4.4MB for the old grammar, and 3.0MB for the new one.
Smaller size of the generated parser.c: 24MB -> 14MB.
Significantly faster time to generate, and slightly faster build.

On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
Smaller size of parser.c, from 24MB down to 14MB.
All terminals now have proper text nodes when possible, like the . in modules. Fixes #102, #107, #115 (partially?).
Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111.
Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)
Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74.
Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108.
Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116.
Explicit brace layouts are now handled correctly. Fixes #92.

Function application with multiple block arguments is handled correctly. Example:

a = do \ a -> a
  case a of
    a -> a
  do a + a
  if
 | a -> a
 | a -> a
*
do a

Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.
Haddock comments have dedicated nodes now.
Use named precedences instead of closely replicating the GHC parser's productions.
Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.
Fixed CPP bug where mid-line #endif would be false positive.
CPP only matches legal directives now.
Generally more lenient parsing than GHC, and in the presence of errors:
- Missing closing tokens at EOF are tolerated for:
- CPP
- Comment
- TH Quotation
- Multiple semicolons in some positions like if/then
- Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions
List comprehensions can have multiple sets of qualifiers (ParallelListComp).
Deriving clauses after GADTs don't require layout anymore.
Newtype instance heads are working properly now.
Escaping newlines in CPP works now.
One remaining issue is that qualified left sections that contain infix ops are broken: (a + a A.+) I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. Solved this by implementing qualified operator lookahead.
Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.

tek commented 4 months ago

hrm, somehow rebasing broke the working tree 🤔

clason commented 4 months ago

oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward

Not really, time is finite and the number of parsers is seemingly infinite. The org was just offering a home to a number of parsers looking for help with automated maintenance.

But, yes, the upstream guidance is to omit parser.c (but not grammar.json) from the repo in favor of versioned release artifacts with generated files, and we are very much looking forward to being able to rely on the latter more widely. Upstream is actively working on adding tooling to make this easier.

For now, my curiosity is satisfied that there will not be a breaking update to the master branch (which we track and so any breaking changes there force us to act) anytime soon, so I can wait and see how things play out. If anybody jumps the gun and makes a PR switching the branch, I'll now know what's what.

tree-sitter / tree-sitter-haskell

Rewrite the grammar once again #120