Closed tek closed 6 months ago
hrm, somehow rebasing broke the working tree š¤
oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars org or something, without committing generated files going forward
Not really, time is finite and the number of parsers is seemingly infinite. The org was just offering a home to a number of parsers looking for help with automated maintenance.
But, yes, the upstream guidance is to omit parser.c (but not grammar.json) from the repo in favor of versioned release artifacts with generated files, and we are very much looking forward to being able to rely on the latter more widely. Upstream is actively working on adding tooling to make this easier.
For now, my curiosity is satisfied that there will not be a breaking update to the master branch (which we track and so any breaking changes there force us to act) anytime soon, so I can wait and see how things play out. If anybody jumps the gun and makes a PR switching the branch, I'll now know what's what.
I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that have accumulated over the years. The new implementation provides a significant set of improvements, listed below.
Since the repo has become uncloneable in some situations, I would also prefer to change the workflow so that
parser.c
is only generated on a release branch, not for every commit, like some other grammars do it. However, since the current situation is already bad, it seems that the only way out would be to reset the history, which would break consumers like nixpkgs who rely on being able to access older commits. An alternative could be to use a new Github location, but that would also be awkward. In any case, it's probably better to handle this later.I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly named. For example, I added
header
/imports
/declarations
for the top-level structure. Since I don't have much experience using the grammar viatree-sitter
API directly, I've launched a survey on the Haskell discourse to get some more feedback, so I can use the opportunity of introducing breaking changes to improve the grammar for users.Not sure about the wasm artifact ā is it still necessary to use that patch included in the
Makefile
?I'd appreciate opinions and feedback!
Please take a look, @414owen @amaanq @wenkokke
Parses the GHC codebase!
I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.
This used to break horribly in many files because explicit brace layouts weren't supported very well.
Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in
test/libs
:Old:
New:
GHC's
compiler
directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), runtest/parse-libs
. I also added an interface for runninghyperfine
, exposed as a Nix app ā executenix run .#bench-libs -- stm mtl transformers
with the desired set of libraries intest/libs
ortest/libs/tsh-test-ghc/libraries
.Smaller size of the shared object.
tree-sitter generate
produces ahaskell.so
with a size of 4.4MB for the old grammar, and 3.0MB for the new one.Smaller size of the generated
parser.c
: 24MB -> 14MB.Significantly faster time to generate, and slightly faster build.
On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
Smaller size of
parser.c
, from 24MB down to 14MB.All terminals now have proper text nodes when possible, like the
.
in modules. Fixes #102, #107, #115 (partially?).Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111.
Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)
Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74.
Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108.
Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116.
Explicit brace layouts are now handled correctly. Fixes #92.
Function application with multiple block arguments is handled correctly. Example:
Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.
Haddock comments have dedicated nodes now.
Use named precedences instead of closely replicating the GHC parser's productions.
Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.
Fixed CPP bug where mid-line
#endif
would be false positive.CPP only matches legal directives now.
Generally more lenient parsing than GHC, and in the presence of errors:
if/then
List comprehensions can have multiple sets of qualifiers (
ParallelListComp
).Deriving clauses after GADTs don't require layout anymore.
Newtype instance heads are working properly now.
Escaping newlines in CPP works now.
One remaining issue is that qualified left sections that contain infix ops are broken:Solved this by implementing qualified operator lookahead.(a + a A.+)
I haven't managed to figure out a good strategy for this ā my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.