Closed tek closed 6 months ago
I would also prefer to change the workflow so that parser.c is only generated on a release branch, not for every commit, like some other grammars do it.
In this case, please at least commit the grammar.json
, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.
Not sure about the wasm artifact – is it still necessary to use that patch included in the Makefile?
You can use tree-sitter build --wasm
with the latest version. Please upload that as a release artifact as well.
Important question: will queries remain compatible?
In this case, please at least commit the
grammar.json
, which is much smaller and has meaningful diffs. This will allow people to generate the parser without having to use node.
Ah, didn't know that. How do you generate the parser from that file?
You can use
tree-sitter build --wasm
with the latest version. Please upload that as a release artifact as well.
That's how we're building it, but @wenkokke added a patch for web-tree-sitter
for some reason I don't recall.
I regenerated the .wasm
file and included it in the PR. Are you talking about Github release artifacts?
Important question: will queries remain compatible?
Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be. I think most expression queries will still work, but types and top level structures should be affected.
Ah, didn't know that. How do you generate the parser from that file?
Just tree-sitter generate src/grammar.json
(little known fact).
That's how we're building it, but @wenkokke added a patch for web-tree-sitter for some reason I don't recall.
Possibly because the output dir has changed? But that should be fixed upstream.
I regenerated the .wasm file and included it in the PR. Are you talking about Github release artifacts?
Yes, the Github release artifacts (only; it shouldn't be committed to the repo).
Some of them will, but it's gonna depend on the result of the survey how contained the breakage will be. I think most expression queries will still work, but types and top level structures should be affected.
Thanks for the heads-up. I hope the queries here can serve as a guide for nvim-treesitter.
Wow @tek this sounds incredible, clearly a labour of love, and I'm hugely grateful that you undertook such a huge amount of work for the community. The sheer quantity of listed improvements is astounding.
I'll try my best to look over it, but given the large amount of grammar changes, and that I've mostly just dealt with the scanner in the past, you're clearly the one who needs to decide when this is ready :)
🚀 🚀 🚀
That's how we're building it, but @wenkokke added a patch for
web-tree-sitter
for some reason I don't recall.
The reason is that web-tree-sitter doesn't support all of libc, and there's a file in the web-tree-sitter source that determines exactly what symbols are supported.
Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.
That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?
@414owen thanks so much! I definitely went a bit overboard with the obsession over the last few weeks 😁 I don't think it's necessary to review the scanner in detail – running that fuzzer should be enough to complement the tests.
Please test the WASM build before merging. Tests using only the standard testing infrastructure sadly aren't reliable.
Ah, I'll make sure to bring the wasm build and release up to speed then
FYI @wenkokke it appears that the file with the exports was refactored in https://github.com/tree-sitter/tree-sitter/commit/d2900510f61b814e71ab5e3e0d8b59be2d276a6e, now not containing standard library symbols anymore
edit: Ah! It's in here now, and appears to contain realloc
, which your patch added. So we're good?
That is an upstream issue, surely? The point of wasm is to be independent of both the source and the consumer. If something isn't working that should be working, tree-sitter should know about it rather than band-aiding it?
This isn't a WASM issue, but rather an emscripten feature that lets you not ship the entirety of libc and libc++ for your web app.
That means that tree-sitter, as the entry point, bundles libc and libc++ and must pick what parts they want. They generally pick only those parts they need for there library itself to run.
FYI @wenkokke it appears that the file with the exports was refactored in https://github.com/tree-sitter/tree-sitter/commit/d2900510f61b814e71ab5e3e0d8b59be2d276a6e, now not containing standard library symbols anymore
edit: Ah! It's in here now, and appears to contain
realloc
, which your patch added. So we're good?
Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.
They generally pick only those parts they need for there library itself to run.
Not true; they also expose (more and more) for external scanners to use. Hence my recommendation to raise an upstream issue.
Important question: will queries remain compatible?
@clason Since I've worked on a lot of the queries in nvim-treesitter, and also maintain neotest-haskell, which also uses tree-sitter queries, I'll be taking part in the survey. Personally, I'm quite excited about some of the proposed changes and wouldn't mind backward compatibility breaking. And of course, I'd be happy to adjust the queries in nvim-treesitter.
@mrcjkb Thank you, then I don't need to worry :) (To be clear, I didn't want to argue for not changing the queries; I just wanted a heads-up how much breakage I should expect so I can brace for it.)
Probably, but you've rewritten the C code, so if you depend on anything new we might need the infrastructure again. Though it's more advisable to just remove functions that aren't supported.
@wenkokke I fiddled with the test runner to get it to work for wasm, and it appears that everything is working – when I run test/parse-libs wasm
, all libraries are parsed successfully!
@414owen I'll add comments to the scanner later, so you won't have to keep guessing 😄
@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in tree_sitter_haskell_external_scanner_scan
to avoid having it as a parameter of every function, so it's not thread safe.
@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in
tree_sitter_haskell_external_scanner_scan
to avoid having it as a parameter of every function, so it's not thread safe.
Would this result in problems if the library is used, e.g., to parse the contents of multiple tabs in VSCode concurrently? I think it likely will.
Would this result in problems if the library is used, e.g., to parse the contents of multiple tabs in VSCode concurrently? I think it likely will.
I don't see any obvious uses of mutexes in https://github.com/tree-sitter/tree-sitter/blob/master/lib/src/parser.c#L437, so I guess it's up to the editor ☹️
ah well, probably easier to just add all the function arguments back. if we could just use C11 those pointers could get thread-local storage class... 🙄
ah well, probably easier to just add all the function arguments back. if we could just use C11 those pointers could get thread-local storage class... 🙄
I support passing the state along explicitly, as a parameter.
It requires significant support from the linker (ld), dynamic linker (ld.so), and system libraries (libc.so and libpthread.so), so it is not available everywhere
@clason
Just
tree-sitter generate src/grammar.json
(little known fact).
If you cloned the repo and used tree-sitter generate
, would it make much of a difference whether the input was the javascript grammar or grammar.json
?
Yes, the Github release artifacts (only; it shouldn't be committed to the repo).
I added a CI job that publishes the wasm artifact to the releases.
FYI @mrcjkb I copied the queries from nvim-treesitter and adapted them, if you want to give them a once-over.
If you cloned the repo and used tree-sitter generate, would it make much of a difference whether the input was the javascript grammar or grammar.json?
Depends; for some grammars it is. The big benefit is if the javascript needs to be interpreted with npm
(e.g., if you import
other grammars); this won't be needed if you generate from the json
.
Depends; for some grammars it is. The big benefit is if the javascript needs to be interpreted with
npm
(e.g., if youimport
other grammars); this won't be needed if you generate from thejson
.
ah, interesting
@amaanq can you tell me whether the scanner is ever executed concurrently? I've made the state pointer a global variable that I assign in
tree_sitter_haskell_external_scanner_scan
to avoid having it as a parameter of every function, so it's not thread safe.
Do not do this, it can be executed concurrently in the sense that two different parses might be running, which will affect this global variable, and TLS is also not recommended at all, for reasons Owen mentioned.
@amaanq
it can be executed concurrently in the sense that two different parses might be running
thanks! Is it impossible for two threads to concurrently run the scanner on the same tree (i.e. same serialized state)?
There are real performance concerns to be taken into account.
FYI: I'm seeing a 7.5% increase in parse time with thread-local!
hey @wenkokke I posted a question for you on the discourse, in case you don't check your notifications there ☺️
hey @wenkokke I posted a question for you on the discourse, in case you don't check your notifications there ☺️
I extremely don't, thanks for the heads up.
Unrelated point, but using field names really really simplifies queries. Does the new grammar use field names?
Unrelated point, but using field names really really simplifies queries. Does the new grammar use field names?
I have already added many and plan to provide full coverage!
If you have any opinions about naming or some aspect I don't know about, please let me know
From practice, it's great if fields that serve a similar purpose have the dane name, but that kinda forces you to correctly guess what things are similar from the perspective of structured editing, which may not generalize.
Friendly reminder to keep the grammar.json
in the repo, or nvim-treesitter will not be able to install this parser.
Friendly reminder to keep the
grammar.json
in the repo, or nvim-treesitter will not be able to install this parser.
yep on my list 😅 I'm postponing this till the end to avoid cluttering my dev history
alright, I think we can merge tomorrow if there's nothing else
here goes nothing
So this was merged to the main
branch -- will that become the default branch at some point?
@clason not sure – for now I would leave master
as default so that parser.c
is present when the repo is cloned. If you have suggestions for a good system I'm all ears.
(in case it's unclear – the release
workflow will generate parser.c
and create a commit on master
whenever a tag is pushed)
Ah, master
is your "release branch". Some parsers do that, but I never understood the point. Isn't it easier to have a release workflow that makes an actual release with artifacts (tarball and wasm) for downloading, and keep the repo as "pure source"?
I would prefer that, but as far as I could tell there are consumers who want to have a parser.c
in the source tree.
(I've also created artifact release workflows btw.)
I would prefer that, but as far as I could tell there are consumers who want to have a parser.c in the source
Yes, but does it have to be in a git branch? You can just generate the parser.c in-tree and then zip the whole thing up; the output will be identical from cloning the repo (ignoring the .git
folder).
my motivation was that if someone uses the plain github URL in their application to pull the repo, they'll get the generated parser like it used to be the case in the past.
When I researched this issue I got the impression that some projects depend on this, but if you have better insight into the ecosystem I'd be happy to scrap that workflow.
Well, I understand not wanting to make sudden breaking changes on master
, that's completely fine. I am just honestly curious why people depend on this specific setup. If you are only committing the parser to master
on manual tags/releases, then there should be zero difference from a downstream pov between an archive/<tag>.tar.gz
link and a releases/download/<tag>/tree-sitter-haskell.tar.gz
link (where the latter is created from the full repo download plus generated parser.c
)?
(Unless I misunderstood and you are pushing on every commit to main
-- but then you're back at square one with master
bloating?)
no, you understood that fine. I don't like it either, but that's what I learned 🤷🏻 I saw that tree-sitter-swift had multiple requests to include the generated parser, even though they have a separate release branch! so my reasoning was that keeping master
as a legacy branch will avoid having to respond to one of those once a month 😅 at least until it has become the new standard.
Of course you could argue that not playing that game would accelerate adoption...
You should just check in parser.c into master for the time being, and not have a "main" branch + "release" branch. In the future, we will be getting rid of the need for this but for now, this is the way to go. I also wish you had waited till I checked this out - I'm not too familiar with Haskell so I wouldn't be particularly useful in commenting on the grammar, but I would have like to checked changes around everything else in this PR.
I could understand that if people wanted to pull in that branch as a git submodule (in which case 🤢)...
I do think the best strategy is "keep parser.c out of repo (unless it's small), and create release artefacts for 1. the full repo tree plus the generated parser.c and 2. the generated wasm". (You certainly shouldn't commit the generated wasm to the repo...) Downstream users who can't handle generating their own parser shouldn't YOLO track master, either.
@amaanq no reason why you shouldn't be able to check the changes – nothing has been released yet.
@clason understandable or not, people will open issues without reading the docs. I'd rather avoid having to deal with it over and over again
People will open issues without reading the docs no matter what you do ;)
But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.
People will open issues without reading the docs no matter what you do ;)
well they certainly won't open any about parser.c
missing in the default branch if I keep it there!
But I am not trying to convince you of anything; I just wanted to know the plan so nvim-treesitter can adapt. In this case, we won't have to do anything for a long while since we track the default branch.
I would have expected you to be eager to consume the tags or artifacts instead, irrespective of the existence of a legacy branch.
We are not set up for that yet (because the ecosystem isn't) but that is definitely where we want to be going.
(But that will require the release tarballs to have the exact same layout as the repo.)
oh I see. well in any case I'd assume that at some point all repos will be recreated from scratch in the tree-sitter-grammars
org or something, without committing generated files going forward
I decided to redesign the scanner and parts of the grammar due to the large number of issues without obvious fix that have accumulated over the years. The new implementation provides a significant set of improvements, listed below.
Since the repo has become uncloneable in some situations, I would also prefer to change the workflow so that
parser.c
is only generated on a release branch, not for every commit, like some other grammars do it. However, since the current situation is already bad, it seems that the only way out would be to reset the history, which would break consumers like nixpkgs who rely on being able to access older commits. An alternative could be to use a new Github location, but that would also be awkward. In any case, it's probably better to handle this later.I've overhauled the tree structure a bit, mostly for higher-level nodes, since I found some parts to be lacking or badly named. For example, I added
header
/imports
/declarations
for the top-level structure. Since I don't have much experience using the grammar viatree-sitter
API directly, I've launched a survey on the Haskell discourse to get some more feedback, so I can use the opportunity of introducing breaking changes to improve the grammar for users.Not sure about the wasm artifact – is it still necessary to use that patch included in the
Makefile
?I'd appreciate opinions and feedback!
Please take a look, @414owen @amaanq @wenkokke
Parses the GHC codebase!
I'm using a trimmed set of the source directories of the compiler and most core libraries in this repo.
This used to break horribly in many files because explicit brace layouts weren't supported very well.
Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken too seriously, using the test codebases in
test/libs
:Old:
New:
GHC's
compiler
directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), runtest/parse-libs
. I also added an interface for runninghyperfine
, exposed as a Nix app – executenix run .#bench-libs -- stm mtl transformers
with the desired set of libraries intest/libs
ortest/libs/tsh-test-ghc/libraries
.Smaller size of the shared object.
tree-sitter generate
produces ahaskell.so
with a size of 4.4MB for the old grammar, and 3.0MB for the new one.Smaller size of the generated
parser.c
: 24MB -> 14MB.Significantly faster time to generate, and slightly faster build.
On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
Smaller size of
parser.c
, from 24MB down to 14MB.All terminals now have proper text nodes when possible, like the
.
in modules. Fixes #102, #107, #115 (partially?).Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111.
Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock)
Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74.
Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108.
Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116.
Explicit brace layouts are now handled correctly. Fixes #92.
Function application with multiple block arguments is handled correctly. Example:
Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection.
Haddock comments have dedicated nodes now.
Use named precedences instead of closely replicating the GHC parser's productions.
Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout.
Fixed CPP bug where mid-line
#endif
would be false positive.CPP only matches legal directives now.
Generally more lenient parsing than GHC, and in the presence of errors:
if/then
List comprehensions can have multiple sets of qualifiers (
ParallelListComp
).Deriving clauses after GADTs don't require layout anymore.
Newtype instance heads are working properly now.
Escaping newlines in CPP works now.
One remaining issue is that qualified left sections that contain infix ops are broken:Solved this by implementing qualified operator lookahead.(a + a A.+)
I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.