sourcegraph / langur

Programming language detection library & CLI
Apache License 2.0
0 stars 0 forks source link

Go API #14

Open varungandhi-src opened 9 months ago

varungandhi-src commented 9 months ago

Related: https://github.com/sourcegraph/sourcegraph/issues/56379

There are a few problems with go-enry.

It would be nice to have a single point for language detection in Go which can be used by code in the monorepo and in other repos like Zoekt. There are five main options here:

  1. Status quo: Slowly grow lib/codeintel/languages and continue using go-enry as today.
  2. Upstream: Submit changes to go-enry upstream (optionally maintain lib/codeintel/languages on top): For example, if the list of all languages were exposed in the public API, then in downstream code, we could re-create the mapping of file extension -> language list (not ideal, as go-enry already has this available, but do-able).
  3. Fork: Fork go-enry (optionally maintain lib/codeintel/languages on top), and make the changes we need there (i.e. skip the upstreaming step in 2.)
  4. Codegen: Move lib/codeintel/languages into this repo + generate some extra code from Linguist.
  5. CGo: Replace usage of go-enry entirely with CGo-based bindings to the Rust code that we already have (or maybe a mix of bindings + some codegen).

Here is a rough comparison table with my current thinking:

Consideration Status quo Upstream Fork Codegen CGo
API Restricted - We can only build on top of what go-enry exposes Restricted-ish - We might be able to expose more things, but upstream may have resistance to adding some APIs for our use cases Somewhat flexible - We probably don't want to deviate too much from upstream for ease of maintenance Flexible - We can do whatever we want Flexible - We can do whatever we want
Maintenance Mostly version bumps, fix issues one-by-one when they come up Mostly version bumps, might require occasionally submitting patches upstream & waiting Need to update the fork from time to time, likely after upstream updates to a new version of Linguist Bump the version of Linguist directly, and then update Sourcegraph to use new version of generated code Bump the version of Linguist directly, and then update Sourcegraph to use new version of generated code. Initially, there may be extra issues related to CGo
Complexity Nothing extra Nothing extra Depends on the extent of changes we make Depends on the code generation steps High - CGo generally has worse link times, poor debugger support, complex cross-compilation
Tech (more is worse) Go Go Go Go + Bazel (+ maybe a little bit of Rust if we keep the codegen for both Rust and Go in Rust) Go + Bazel + Rust
Performance OK OK OK OK Likely 10x - I have not benchmarked it, but basing this off the numbers in the upstream hyperglotglot repo, and the core logic here is mostly unchanged

Performance is mentioned only at the very end, as we currently already have a workaround where we set a 2048 byte size limit for the file contents slice passed to go-enry to avoid spending a lot of time on language detection. We have not done any A/B tests to know whether increasing this limit would help improve language detection in practice (and if so, by how much).

varungandhi-src commented 9 months ago

Given that most of our backend engineers are much more comfortable with Go compared to Rust, and that CGo has many downsides, I'm currently leaning towards option 4: Codegen.

varungandhi-src commented 5 months ago

API requirement: