~It has a narrow API surface, which doesn't expose a lot of the data available in Linguist. For example, it doesn't expose the maps for file extension -> language, or the list of all languages~ (Correction: I think most of the data is exposed through https://pkg.go.dev/github.com/go-enry/go-enry/v2@v2.8.8/data). It does not expose the sample files for various languages (from Linguist) either, which would be useful for writing test cases.
Practical example where we run into this: The ALL_LANGUAGES constant in our TypeScript code needs to updated manually by running a shell command, which does some text manipulation on one of go-enry's source files. If this list were exposed as API, we could write a small Go binary that writes out a TypeScript file with the constant, and whenever go-enry is updated (or you'd get a PR failure if we check that in and do a golden test). (This specific issue could also be worked around by adding a small TypeScript package to this repo which exposes some utilities for client-side code in Sourcegraph.)
The API is a bit error-prone to use. Some examples:
Languages are strings so can't catch typos
Sometimes language strings can be empty, instead of returning a proper error
It's easy to forget considering the case that the language for a file may be ambiguous
We currrently have a wrapper package lib/codeintel/languages in the Sourcegraph monorepo to work these to some extent.
It would be nice to have a single point for language detection in Go which can be used by code in the monorepo and in other repos like Zoekt. There are five main options here:
Status quo: Slowly grow lib/codeintel/languages and continue using go-enry as today.
Upstream: Submit changes to go-enry upstream (optionally maintain lib/codeintel/languages on top): For example, if the list of all languages were exposed in the public API, then in downstream code, we could re-create the mapping of file extension -> language list (not ideal, as go-enry already has this available, but do-able).
Fork: Fork go-enry (optionally maintain lib/codeintel/languages on top), and make the changes we need there (i.e. skip the upstreaming step in 2.)
Codegen: Move lib/codeintel/languages into this repo + generate some extra code from Linguist.
CGo: Replace usage of go-enry entirely with CGo-based bindings to the Rust code that we already have (or maybe a mix of bindings + some codegen).
Here is a rough comparison table with my current thinking:
Consideration
Status quo
Upstream
Fork
Codegen
CGo
API
Restricted - We can only build on top of what go-enry exposes
Restricted-ish - We might be able to expose more things, but upstream may have resistance to adding some APIs for our use cases
Somewhat flexible - We probably don't want to deviate too much from upstream for ease of maintenance
Flexible - We can do whatever we want
Flexible - We can do whatever we want
Maintenance
Mostly version bumps, fix issues one-by-one when they come up
Need to update the fork from time to time, likely after upstream updates to a new version of Linguist
Bump the version of Linguist directly, and then update Sourcegraph to use new version of generated code
Bump the version of Linguist directly, and then update Sourcegraph to use new version of generated code. Initially, there may be extra issues related to CGo
Complexity
Nothing extra
Nothing extra
Depends on the extent of changes we make
Depends on the code generation steps
High - CGo generally has worse link times, poor debugger support, complex cross-compilation
Tech (more is worse)
Go
Go
Go
Go + Bazel (+ maybe a little bit of Rust if we keep the codegen for both Rust and Go in Rust)
Go + Bazel + Rust
Performance
OK
OK
OK
OK
Likely 10x - I have not benchmarked it, but basing this off the numbers in the upstream hyperglotglot repo, and the core logic here is mostly unchanged
Performance is mentioned only at the very end, as we currently already have a workaround where we set a 2048 byte size limit for the file contents slice passed to go-enry to avoid spending a lot of time on language detection. We have not done any A/B tests to know whether increasing this limit would help improve language detection in practice (and if so, by how much).
Given that most of our backend engineers are much more comfortable with Go compared to Rust, and that CGo has many downsides, I'm currently leaning towards option 4: Codegen.
Related: https://github.com/sourcegraph/sourcegraph/issues/56379
There are a few problems with go-enry.
ALL_LANGUAGES
constant in our TypeScript code needs to updated manually by running a shell command, which does some text manipulation on one of go-enry's source files. If this list were exposed as API, we could write a small Go binary that writes out a TypeScript file with the constant, and whenever go-enry is updated (or you'd get a PR failure if we check that in and do a golden test). (This specific issue could also be worked around by adding a small TypeScript package to this repo which exposes some utilities for client-side code in Sourcegraph.)lib/codeintel/languages
in the Sourcegraph monorepo to work these to some extent.It would be nice to have a single point for language detection in Go which can be used by code in the monorepo and in other repos like Zoekt. There are five main options here:
lib/codeintel/languages
and continue using go-enry as today.lib/codeintel/languages
on top): For example, if the list of all languages were exposed in the public API, then in downstream code, we could re-create the mapping of file extension -> language list (not ideal, as go-enry already has this available, but do-able).lib/codeintel/languages
on top), and make the changes we need there (i.e. skip the upstreaming step in 2.)lib/codeintel/languages
into this repo + generate some extra code from Linguist.Here is a rough comparison table with my current thinking:
Performance is mentioned only at the very end, as we currently already have a workaround where we set a 2048 byte size limit for the file contents slice passed to go-enry to avoid spending a lot of time on language detection. We have not done any A/B tests to know whether increasing this limit would help improve language detection in practice (and if so, by how much).