paul-gauthier / grep-ast

Grep source code and see useful code context about matching lines
Apache License 2.0
116 stars 13 forks source link

`py-tree-sitter-languages` is unmaintained #7

Open jvmncs opened 2 months ago

jvmncs commented 2 months ago

Hi @paul-gauthier , thanks for your work on aider. I've been having a blast using it.

This project uses https://github.com/grantjenks/py-tree-sitter-languages, but that project is unmaintained and has been for several months. This forces grep-ast to be stuck on an old tree-sitter version (0.21) and also limits the number of parsers that can be used by upstream projects (including aider). There is a hacky way to install new language parsers, but that dependency will seemingly be stuck on tree-sitter 0.21 indefinitely, which seems bad.

Another project has sprung up called tree-sitter-language-pack, however it's got a slightly different intention (large collection of grammar binaries, as opposed to small/focused one for the most popular languages only). That project is mainly an integration of this unmerged tree-sitter-languages PR with a bunch of new grammar binaries added. There's probably space for a minimal version that bundles just the top N languages and natively allows users to install their own binaries at will (so, essentially, just a version of tree-sitter-languages with that PR merged, and some different grammar binaries).

If you want to replace tree-sitter-languages with tree-sitter-language-pack, I'd be happy to open a PR. Note that the source binary size is quite a bit larger:

AdjectiveAllison commented 1 month ago

I don't have a big investment in this decision, so my opinion might not be worth much but I'm sharing it anyway:

I agree that there is likely space for a best of both worlds option. Maintained, smaller binary, adaptable, the dream! I am personally on board with the large language pack route over the tree sitter version being locked until that gap is filled.

Also ditto what @jvmncs said on aider, both it and this repo are solid tools.

Goldziher commented 1 month ago

Hi, author of tree-sitter-language-pack here. PRs are welcome. Also, its fine by me to have multiple packages built in the same repo - we can have a minimal build and a more comprehensive build.

greg-hellings commented 1 month ago

This does make packaging aider, which I am working on, something of a sticky bit. It is possible currently to get around the build issues for py-tree-sitter-languages by pinning tree-sitter to version 0.21.x and explicitly including distutils in the package. But having this whole tree depend on a maintained package would be better, overall, rather than requiring introduction of an abandoned package into a new tool. That workaround probably will not last forever, so it would be fantastic to have a version of grep-ast that did not require the older/abandoned dependencies.

greg-hellings commented 1 month ago

Ultimately this is resulting in aider not really being able to package for Python 3.12 easily, as tree-sitter 0.21 doesn't like newer versions of Python.

jmehnle commented 3 weeks ago

@greg-hellings, I don't quite understand the Python 3.12 concern. According to https://github.com/tree-sitter/py-tree-sitter/commit/ce1af663b4dd5b933a81dc893f36cabbef266ac4, tree-sitter 0.21.1 and above do build on Python 3.12. Is that somehow not the case?

greg-hellings commented 3 weeks ago

@jmehnle Indeed tree-sitter did come out with a 0.21.2 that supported Python 3.12. But most people who are consuming this outside of a pip install are going to use their system libraries. py-tree-sitter-languages is incompatible with tree-sitter 0.22+ which most Linux distributions have moved to because of its improved support for Python 3.12 and because it's the latest. And, since tree-sitter-languages is abandonware, it will not likely ever be updated. It would be better for anyone consuming this if an updated dependency was leveraged, instead. The issue isn't the transitive dep on tree-sitter itself. It's the dependency on tree-sitter-languages which has been abandoned and therefore doesn't support a pip install nor have good support in packaging distributions.

jmehnle commented 3 weeks ago

Ok, so there's not a specific major problem with Python 3.12. I understand that recent versions of the tree-sitter package >=0.22 have improved support for 3.12, but >=0.21.1,<0.22 should still run on 3.12. I'm also aware of the other issues you mentioned, and I'm very much interested in us building a successor package that is both future-compatible and smaller than tree-sitter-language-pack. The author of the latter seems to be open to building a version of the package that's limited to a subset of languages.

greg-hellings commented 3 weeks ago

Correct, the issue is that the dep makes moving forward with Python versions and grep-ast more tedious. Not that there is directly a problem with 3.12 but that the issue is with a stale dep.

gohanlon commented 2 weeks ago

I've just opened PR #8 (Draft) to migrate grep-ast to Goldziher/tree-sitter-language-pack, significantly expanding language support and (hopefully) resolving the maintenance concerns discussed here.

Please take a look and let me know your thoughts!

paul-gauthier commented 2 weeks ago

The PR looks great, thanks for preparing it.

Any thoughts on how the pip install of -language-pack compares to -languages? On my mac, -pack took 4 minutes and ~130MB whereas -languages takes <2 seconds and 80MB.

It seems like -languages had pre-built wheels and -pack is building it on my local?

But more than the time difference, will -pack install cleanly in roughly the same set of environments that -languages did?

The main reason I adopted -languages as a dependency was because it reliably installed in a wide range of environments.

greg-hellings commented 2 weeks ago

It sounded like @Goldziher was open to improving the experience with -pack, up above. It's generally considered bad form to have a derived file, like a wheel, included in your source tree but it sounds like for -languages it was a huge performance boost to ship them in that manner.

gohanlon commented 2 weeks ago

I do think size and build time are issues needing careful consideration. I too built tree-sitter-language-pack locally. For me, the size and build time, despite being substantial, aren't all that significant compared to the benefits. But, I'm sure we can (and should) do better:

@Goldziher I see that the published files on pypi.org/tree-sitter-language-pack don't include any wheels, but that you've worked on some infrastructure to build and publish wheels. This seems like it'd be a non-trivial undertaking. Can you comment on the status/challenges of that work?

@paul-gauthier If tree-sitter-language-pack builds and publishes wheels with broad enough compatibility, how would that impact your evaluation of (the draft) PR #8?

Besides adding wheels, we could implement a modular system for language support. However, that'd be a much larger undertaking and it's probably better to focus first on the immediate benefits of migrating to a maintained package with expanded language coverage, despite the increased size and build time.

(For anyone curious, here are the files for grantjenks's pack on PyPI, including wheels, and here're its GitHub Actions workflow and build script.)

paul-gauthier commented 2 weeks ago

The PR looks great. I would love to support all those languages.

My only hesitation is the end user pip install experience:

  1. If it takes 4 minutes to pip install the tree-sitter-language-pack dependency, that's a lot to ask of users.
  2. How often will the pip install fail when the wheel is being built on demand on the user's machine? I'm not sure what's involved in that step, but it feels like there is potential for problems given the diversity of end user build environments.
gohanlon commented 2 weeks ago
  1. If it takes 4 minutes to pip install the tree-sitter-language-pack dependency, that's a lot to ask of users.

@paul-gauthier Does adding pre-built wheels to tree-sitter-language-pack address your install time concerns?

  1. How often will the pip install fail when the wheel is being built on demand on the user's machine? I'm not sure what's involved in that step, but it feels like there is potential for problems given the diversity of end user build environments.

The user build failure rate would likely increase somewhat due to the larger number of grammar projects, but the extent is hard to predict. I suspect the increase would be small, as build environments are often generally broken rather than failing on specific projects. (Importantly, if the new pack's pre-built wheels cover the same targets as the old pack's, the fallback rate to source builds should be identical.)

With the unmaintained language pack's lack of ongoing support for newer systems, we should expect increases in both fallbacks to user builds and user build failures over time.

A system for modular language packs could be ideal, e.g.:

pip install tree-sitter-language-pack[core,gleam,zig]

This would install expected "core" languages plus Gleam and Zig. Other language grammars could be added without concern for bloat or risking breaking user builds. However, I'm less sure that the time and effort required for this modular approach is best way forward now.