Go API - Githubissues

varungandhi-src commented 9 months ago

There are a few problems with go-enry.

~It has a narrow API surface, which doesn't expose a lot of the data available in Linguist. For example, it doesn't expose the maps for file extension -> language, or the list of all languages~ (Correction: I think most of the data is exposed through https://pkg.go.dev/github.com/go-enry/go-enry/v2@v2.8.8/data). It does not expose the sample files for various languages (from Linguist) either, which would be useful for writing test cases.
- Practical example where we run into this: The ALL_LANGUAGES constant in our TypeScript code needs to updated manually by running a shell command, which does some text manipulation on one of go-enry's source files. If this list were exposed as API, we could write a small Go binary that writes out a TypeScript file with the constant, and whenever go-enry is updated (or you'd get a PR failure if we check that in and do a golden test). (This specific issue could also be worked around by adding a small TypeScript package to this repo which exposes some utilities for client-side code in Sourcegraph.)
The API is a bit error-prone to use. Some examples:
- Languages are strings so can't catch typos
- Sometimes language strings can be empty, instead of returning a proper error
- It's easy to forget considering the case that the language for a file may be ambiguous We currrently have a wrapper package lib/codeintel/languages in the Sourcegraph monorepo to work these to some extent.
Quiet inefficient, with a high allocation rate: https://github.com/sourcegraph/sourcegraph/pull/60625#issuecomment-1955171498

It would be nice to have a single point for language detection in Go which can be used by code in the monorepo and in other repos like Zoekt. There are five main options here:

Status quo: Slowly grow lib/codeintel/languages and continue using go-enry as today.
Upstream: Submit changes to go-enry upstream (optionally maintain lib/codeintel/languages on top): For example, if the list of all languages were exposed in the public API, then in downstream code, we could re-create the mapping of file extension -> language list (not ideal, as go-enry already has this available, but do-able).
Fork: Fork go-enry (optionally maintain lib/codeintel/languages on top), and make the changes we need there (i.e. skip the upstreaming step in 2.)
Codegen: Move lib/codeintel/languages into this repo + generate some extra code from Linguist.
CGo: Replace usage of go-enry entirely with CGo-based bindings to the Rust code that we already have (or maybe a mix of bindings + some codegen).

Here is a rough comparison table with my current thinking:

Consideration	Status quo	Upstream	Fork	Codegen	CGo
API	Restricted - We can only build on top of what go-enry exposes	Restricted-ish - We might be able to expose more things, but upstream may have resistance to adding some APIs for our use cases	Somewhat flexible - We probably don't want to deviate too much from upstream for ease of maintenance	Flexible - We can do whatever we want	Flexible - We can do whatever we want
Maintenance	Mostly version bumps, fix issues one-by-one when they come up	Mostly version bumps, might require occasionally submitting patches upstream & waiting	Need to update the fork from time to time, likely after upstream updates to a new version of Linguist	Bump the version of Linguist directly, and then update Sourcegraph to use new version of generated code	Bump the version of Linguist directly, and then update Sourcegraph to use new version of generated code. Initially, there may be extra issues related to CGo
Complexity	Nothing extra	Nothing extra	Depends on the extent of changes we make	Depends on the code generation steps	High - CGo generally has worse link times, poor debugger support, complex cross-compilation
Tech (more is worse)	Go	Go	Go	Go + Bazel (+ maybe a little bit of Rust if we keep the codegen for both Rust and Go in Rust)	Go + Bazel + Rust
Performance	OK	OK	OK	OK	Likely 10x - I have not benchmarked it, but basing this off the numbers in the upstream hyperglotglot repo, and the core logic here is mostly unchanged

Performance is mentioned only at the very end, as we currently already have a workaround where we set a 2048 byte size limit for the file contents slice passed to go-enry to avoid spending a lot of time on language detection. We have not done any A/B tests to know whether increasing this limit would help improve language detection in practice (and if so, by how much).

varungandhi-src commented 9 months ago

Given that most of our backend engineers are much more comfortable with Go compared to Rust, and that CGo has many downsides, I'm currently leaning towards option 4: Codegen.

varungandhi-src commented 5 months ago

API requirement:

Add a way to map from SCIP language -> Linguist language and back.

sourcegraph / langur

Go API #14