Open tamasvajk opened 1 year ago
We could reconsider whether parser.c should be committed at all. It is auto-generated with tree-sitter generate, so I am not sure it should be in the repository.
Not committing it would be even better. But I was under the impression that this is a de facto requirement of a tree sitter grammar repo, because this way all you have to do is check out the grammar’s git repository into a well-known location^1. I'm not sure if this is a hard requirement or if we can deviate from it.
@damieng I just checked the recent size increases of src/parser.c
. It looks like https://github.com/tree-sitter/tree-sitter-c-sharp/commit/168390ca8fd6415dcbd976af8c6df833912683f6 increased it from 30 to 50MB. It's most probably because of the [$.assignment_expression, $._expression],
conflict. If I remove it, this is the message I get from tree-sitter:
❯ tree-sitter generate
Unresolved conflict for symbol sequence:
'*' _lvalue_expression • '=' …
Possible interpretations:
1: '*' (_expression _lvalue_expression) • '=' …
2: '*' (assignment_expression _lvalue_expression • assignment_operator _expression) (precedence: 0, associativity: Right)
Possible resolutions:
1: Specify a higher precedence in `assignment_expression` than in the other rules.
2: Specify a higher precedence in `_expression` than in the other rules.
3: Specify a left or right associativity in `_expression`
4: Add a conflict for these rules: `assignment_expression`, `_expression`
Why is there a conflict? Why isn't '*' _lvalue_expression
reduced to _pointer_indirection_expression
which has higher precedence than assignment_expression
?
EDIT: For future reference, I asked the same question in https://github.com/tree-sitter/tree-sitter/discussions/2024.
I don't know enough about how tree-sitter resolves these precedence rules to be helpful here unfortunately (every time I read the docs I feel like I've got a good handle on it but then once I dig into our non-trivial rule set it escapes me)
Whatever we do here it might be worth adding some kind of file size echo to a text file so that we can see at a glance on PR's how a change affects large generated file sizes.
Re: parser.c
and friends being part of the repository
I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src
(src/parser.c
and friends) fetched from grammar repositories.
In nvim-treesitter's case I think there may be logic to call tree-sitter generate
under certain circumstances but that means a user needs to have the tree-sitter
cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.
In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter
's generate
subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696
Re:
parser.c
and friends being part of the repositoryI think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under
src/parser.c
and friends fetched from grammar repositories.In nvim-treesitter's case I think there may be logic to call
tree-sitter generate
under certain circumstances but that means a user needs to have thetree-sitter
cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.In Emacs 29+'s case I don't think there is any logic ATM to run the
tree-sitter
'sgenerate
subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696
We mostly use tar-balls of individual commits and download them via curl. With LFS, I don't know whether src/parser.c
will end up in the tar archive (https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/shell_command_selectors.lua#L215-L216). If src/parser.c
is missing, then we would mark the parser as requires_generate_from_grammar
and users then need to have the tree-sitter-cli installed, which is already the case for some parsers.
It looks like difftastic uses tree-sitter-c-sharp and assumes src/parser.c
is available.
Not really sure though...but I would be surprised if @Wilfred didn't know :)
I'm not sure how cursorless does its building, but I think it also transitively(?) uses tree-sitter-c-sharp.
There's a custom web-tree-sitter
mentioned so may be Emscripten is involved in compiling parser.c
and friends to produce .wasm
.
May be @pokey knows the details.
Here's where we build the wasm:
https://github.com/cursorless-dev/vscode-parse-tree/blob/main/Makefile#L43
So, as long as parser.c
is in the npm package, I think we're ok. Fwiw we do intend to migrate all our tree-sitter language dependencies to use GitHub commit sha's instead of npm versions, because the npm versions tend to be stale. I believe that means we'd run into some issues if you stopped including parser.c in the repo
That being said, I think we could probably add a step to generate parser.c before we compile the wasm, so should be ok.
But I do believe every other tree-sitter grammar repo keeps the generated parser.c
in source control, so it may come as a surprise to other downstream consumers if it does not
I think this should be a discussion across tree-sitter grammars in general. Having tree-sitter-c-sharp be different/unusual from other grammars in this respect will not help adoption.
Unfortunately we don't really have a broad tree-sitter discussion channel. Perhaps we should start one.
I've create a Discord one if anyone is interested https://discord.gg/KrRa5ATb
Agreed. A lot of discussion seems to happen on https://github.com/tree-sitter/tree-sitter/discussions, so it's probably worth asking there. Agreed a Discord server could be useful, though. I just asked to see if there already was one (https://github.com/tree-sitter/tree-sitter/discussions/2027)
There is an option to include Git LFS in archives, which would maybe a possibility for everyone using tree-sitter-c-sharp via tarball. When anyone is using Git, the Git LFS requirement would propagate to the consuming repo.
I found this related discussion: https://github.com/tree-sitter/tree-sitter/discussions/1243.
I tried tree-sitter generate
for tree-sitter-c-sharp and it really takes quite a while doesn't it:
$ time tree-sitter generate
real 4m2.854s
user 3m52.457s
sys 0m10.372s
Memory usage exceeded 50GB here too...
Yes, it has grown quite considerably lately. We're going to have to make some trade-offs between handling every single scenario possible in C# vs performance I suspect.
Somewhat related, I noticed this issue at the tree-sitter-swift repository: https://github.com/alex-pinkus/tree-sitter-swift/issues/149
That's a pretty interesting solution, I'd be up for doing that here. I wonder if we can proactively reach out to dependencies we know will be affected.
The issue of it being non-trivial to contact users of one's grammar seems to be the sort of thing that many grammars will / do face.
I wonder if there is a good approach to this.
We started a sticky issue where we have began to announce potential upcoming changes and asked the users we know about to subscribe to it and let us know if things we're planning on could be a problem: https://github.com/sogaiu/tree-sitter-clojure/issues/33
We tried to ascertain who was using us and then went around to the folks we came up with, but later it turned out there were other folks.
From the perspective of a "user" (e.g. nvim-treesitter, Emacs 29+, difftastic, helix-editor, etc.) though, I would imagine that there would be a preference to not have different methods of being up-to-date about upcoming changes. Some kind of unified way seems like a better deal.
I wonder if there's something that can be done at the tree-sitter repository...like a dedicated thread(?) / discussion per grammar? (I haven't found the Discussions UI / UX to be that great though -- stuff seems to get easily hidden and hard to search -- though possibly that has changed over time.)
I wonder then if we flip what tree-sitter-swift did on its head and instead move development to a new dev
branch that PR's need to be made against that has no generated files.
We'd then adopt the tree-sitter-swift approach that when a release is tagged we generate the files and merge them into master
(rather than with-generated-files
)
That way everyone who is relying on master and the generated artefacts today is unaffected other than lagging behind active development.
Given there's only a small handful of contributors to the repo the work in us changing from developing against master to developing against dev
or develop
would be quite small.
I started receiving the below warning on pushes:
I'm not sure what would be the impact of moving to LFS. Maybe some documentation would need to be updated.