Move `src/parser.c` to git LFS

tamasvajk commented 1 year ago

I started receiving the below warning on pushes:

remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File src/parser.c is 50.90 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.

I'm not sure what would be the impact of moving to LFS. Maybe some documentation would need to be updated.

Sjord commented 1 year ago

We could reconsider whether parser.c should be committed at all. It is auto-generated with tree-sitter generate, so I am not sure it should be in the repository.

tamasvajk commented 1 year ago

Not committing it would be even better. But I was under the impression that this is a de facto requirement of a tree sitter grammar repo, because this way all you have to do is check out the grammar’s git repository into a well-known location^1. I'm not sure if this is a hard requirement or if we can deviate from it.

tamasvajk commented 1 year ago

@damieng I just checked the recent size increases of src/parser.c. It looks like https://github.com/tree-sitter/tree-sitter-c-sharp/commit/168390ca8fd6415dcbd976af8c6df833912683f6 increased it from 30 to 50MB. It's most probably because of the [$.assignment_expression, $._expression], conflict. If I remove it, this is the message I get from tree-sitter:

❯ tree-sitter generate                 
Unresolved conflict for symbol sequence:

  '*'  _lvalue_expression  •  '='  …

Possible interpretations:

  1:  '*'  (_expression  _lvalue_expression)  •  '='  …
  2:  '*'  (assignment_expression  _lvalue_expression  •  assignment_operator  _expression)  (precedence: 0, associativity: Right)

Possible resolutions:

  1:  Specify a higher precedence in `assignment_expression` than in the other rules.
  2:  Specify a higher precedence in `_expression` than in the other rules.
  3:  Specify a left or right associativity in `_expression`
  4:  Add a conflict for these rules: `assignment_expression`, `_expression`

Why is there a conflict? Why isn't '*' _lvalue_expression reduced to _pointer_indirection_expression which has higher precedence than assignment_expression?

EDIT: For future reference, I asked the same question in https://github.com/tree-sitter/tree-sitter/discussions/2024.

damieng commented 1 year ago

I don't know enough about how tree-sitter resolves these precedence rules to be helpful here unfortunately (every time I read the docs I feel like I've got a good handle on it but then once I dig into our non-trivial rule set it escapes me)

Whatever we do here it might be worth adding some kind of file size echo to a text file so that we can see at a glance on PR's how a change affects large generated file sizes.

sogaiu commented 1 year ago

Re: parser.c and friends being part of the repository

I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src (src/parser.c and friends) fetched from grammar repositories.

In nvim-treesitter's case I think there may be logic to call tree-sitter generate under certain circumstances but that means a user needs to have the tree-sitter cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.

In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter's generate subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696

theHamsta commented 1 year ago

Re: parser.c and friends being part of the repository

I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src/parser.c and friends fetched from grammar repositories.

In nvim-treesitter's case I think there may be logic to call tree-sitter generate under certain circumstances but that means a user needs to have the tree-sitter cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.

In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter's generate subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696

We mostly use tar-balls of individual commits and download them via curl. With LFS, I don't know whether src/parser.c will end up in the tar archive (https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/shell_command_selectors.lua#L215-L216). If src/parser.c is missing, then we would mark the parser as requires_generate_from_grammar and users then need to have the tree-sitter-cli installed, which is already the case for some parsers.

sogaiu commented 1 year ago

It looks like difftastic uses tree-sitter-c-sharp and assumes src/parser.c is available.

Not really sure though...but I would be surprised if @Wilfred didn't know :)

sogaiu commented 1 year ago

I'm not sure how cursorless does its building, but I think it also transitively(?) uses tree-sitter-c-sharp.

There's a custom web-tree-sitter mentioned so may be Emscripten is involved in compiling parser.c and friends to produce .wasm.

May be @pokey knows the details.

pokey commented 1 year ago

Here's where we build the wasm:

https://github.com/cursorless-dev/vscode-parse-tree/blob/main/Makefile#L43

So, as long as parser.c is in the npm package, I think we're ok. Fwiw we do intend to migrate all our tree-sitter language dependencies to use GitHub commit sha's instead of npm versions, because the npm versions tend to be stale. I believe that means we'd run into some issues if you stopped including parser.c in the repo

That being said, I think we could probably add a step to generate parser.c before we compile the wasm, so should be ok.

But I do believe every other tree-sitter grammar repo keeps the generated parser.c in source control, so it may come as a surprise to other downstream consumers if it does not

damieng commented 1 year ago

I think this should be a discussion across tree-sitter grammars in general. Having tree-sitter-c-sharp be different/unusual from other grammars in this respect will not help adoption.

Unfortunately we don't really have a broad tree-sitter discussion channel. Perhaps we should start one.

I've create a Discord one if anyone is interested https://discord.gg/KrRa5ATb

pokey commented 1 year ago

Agreed. A lot of discussion seems to happen on https://github.com/tree-sitter/tree-sitter/discussions, so it's probably worth asking there. Agreed a Discord server could be useful, though. I just asked to see if there already was one (https://github.com/tree-sitter/tree-sitter/discussions/2027)

theHamsta commented 1 year ago

grafik

There is an option to include Git LFS in archives, which would maybe a possibility for everyone using tree-sitter-c-sharp via tarball. When anyone is using Git, the Git LFS requirement would propagate to the consuming repo.

tamasvajk commented 1 year ago

I found this related discussion: https://github.com/tree-sitter/tree-sitter/discussions/1243.

sogaiu commented 1 year ago

I tried tree-sitter generate for tree-sitter-c-sharp and it really takes quite a while doesn't it:

$ time tree-sitter generate

real    4m2.854s
user    3m52.457s
sys 0m10.372s

Memory usage exceeded 50GB here too...

damieng commented 1 year ago

Yes, it has grown quite considerably lately. We're going to have to make some trade-offs between handling every single scenario possible in C# vs performance I suspect.

sogaiu commented 1 year ago

Somewhat related, I noticed this issue at the tree-sitter-swift repository: https://github.com/alex-pinkus/tree-sitter-swift/issues/149

damieng commented 1 year ago

That's a pretty interesting solution, I'd be up for doing that here. I wonder if we can proactively reach out to dependencies we know will be affected.

sogaiu commented 1 year ago

The issue of it being non-trivial to contact users of one's grammar seems to be the sort of thing that many grammars will / do face.

I wonder if there is a good approach to this.

We started a sticky issue where we have began to announce potential upcoming changes and asked the users we know about to subscribe to it and let us know if things we're planning on could be a problem: https://github.com/sogaiu/tree-sitter-clojure/issues/33

We tried to ascertain who was using us and then went around to the folks we came up with, but later it turned out there were other folks.

From the perspective of a "user" (e.g. nvim-treesitter, Emacs 29+, difftastic, helix-editor, etc.) though, I would imagine that there would be a preference to not have different methods of being up-to-date about upcoming changes. Some kind of unified way seems like a better deal.

I wonder if there's something that can be done at the tree-sitter repository...like a dedicated thread(?) / discussion per grammar? (I haven't found the Discussions UI / UX to be that great though -- stuff seems to get easily hidden and hard to search -- though possibly that has changed over time.)

damieng commented 1 year ago

I wonder then if we flip what tree-sitter-swift did on its head and instead move development to a new dev branch that PR's need to be made against that has no generated files.

We'd then adopt the tree-sitter-swift approach that when a release is tagged we generate the files and merge them into master (rather than with-generated-files)

That way everyone who is relying on master and the generated artefacts today is unaffected other than lagging behind active development.

Given there's only a small handful of contributors to the repo the work in us changing from developing against master to developing against dev or develop would be quite small.

tree-sitter / tree-sitter-c-sharp

Move `src/parser.c` to git LFS #273