`linguist-fallback-language` attribute

Sainan commented 1 year ago

Currently, people using a newer language on Github either have to accept a total lack of support for the language, or use .gitattributes with linguist-language to select a syntax highlighter that's "good enough". However, what if the language is then added to Linguist? Now the .gitattributes that have accumulated would override the actual, correct detection. Therefore, I suggest a linguist-fallback-language attribute that would allow Linguist to still classify the file if it does know the language, but provides a fallback in case it's not known.

Alhadis commented 1 year ago

The scenario you've described is an extremely specific one, and users are assumed to understand the caveats of (ab)using another language's highlighting to improve the readability of their project's source code. Moreover, introducing yet another attribute will only serve to confuse users further: our overrides system is already over-engineered as it is[^1], so much so that a table is necessary to disambiguate the specifics of each attribute.

[^1]: A simpler way to implement overrides might be linguist-include=diff,stats (and linguist-exclude=), where the value assigned is a list of keywords denoting which GitHub-internal systems are applied to the affected files (where the defaults are determined by Linguist).

Sainan commented 1 year ago

I don't think this is really that specific. For example, we are currently working on a Lua fork called Pluto, which uses the .pluto extension. However, because this language is not supported by Linguist (and is way too unpopular for it to be so), we add *.pluto linguist-language=Lua to .gitattributes to achieve the next-best result, which is having Linguist just count it as Lua and at least syntax-highlight it somewhat — and many users of the language do this too.

Of course, maybe I'm being a bit too hopeful thinking Pluto will ever reach the critera to be included in Linguist, but let's assume it does, then we have a lot of projects now using such a .gitattributes file, and it would basically invalidate the inclusion of the language. So, any way to solve this, either via linguist-fallback-language=Lua or linguist-language=Pluto,Lua would be appreciated!

Alhadis commented 1 year ago

This did happen with the V programming language, who classified *.v files as Go, as it was both syntactically similar and a direct influence on V's design. This was arguably more justified than simply having pleasing-looking source code, though, as *.v files were being classified as System Verilog.

When GitHub did end up supporting V, the .gitattribute hacks in V projects quickly vanished, no doubt thanks to the project's up-front transparency about (mis)classifying an unsupported language as an existing one, as it was mentioned in the project's readme and acknowledged from the start as a transient workaround.

I recommend following a similar approach: it's more future-proof, and it doesn't involve dragging feature creep into Linguist (something we can't get rid of once it's part of the codebase). I understand you're in a rock and a hard place right now, as many prospective users of your language might be discouraged by seeing *.pluto files unsupported on GitHub. The issue of supporting lesser-known languages is a bit of a contentious one, as it lies with GitHub's own policies on how/when to support new technologies.

Sainan commented 1 year ago

I have to be honest, I've never seen a project be so against adding a feature that would barely require maintenance while addressing real user problems, but I can respect the conviction to leanness.

I guess in the "worst case scenario", it just means that project using a new language with such a .gitattributes setup will be classified wrongly after Linguist supports the language for a longer period of time, or possibly forever if it's abandonware. So, not an insane issue, just one I think that is easily avoidable.

Alhadis commented 1 year ago

I have to be honest, I've never seen a project be so against adding a feature that would barely require maintenance, while addressing real user problems

When I speak of "maintenance", it's less about technical or implementation details, and more about the user-facing aspects Linguist maintainers have to deal with in the long-term. Here's a plausible scenario: say that a user encounters a .gitattributes file in the wild with this as its only contents:

# make github highlight our files
*.yalisp linguist-fallback-language=Lisp

Now, with no further understanding of context, the user copy+pastes the override into their own project, but tweaks it so that it fixes a misclassified header file that Linguist categorises as "Objective-C":

*.h linguist-fallback-language=C

Assume this user has no further understanding of Linguist's mechanics, so when the override fails to take effect, they interpret it as a bug. They submit an issue here, and Linguist maintainers need to patiently explain that the override they should've used was linguist-language=C. Naturally, we have to explain the difference between the two attributes, and that linguist-fallback-language is used if and only if no currently supported language recognises the file-extension being targeted. Already, you can see how convoluted this would be to a newer user who only wishes to correct a common misclassification.

Our overrides docs already documents two functionally-similar attributes (linguist-documentation and linguist-vendored), both used in different scenarios, but ultimately fulfil the same role of preventing certain files from skewing a language's statistics graph. Nonetheless, they're documented separately. A new override would require its own section, and—to less experienced (or patient) users navigating our docs for a solution, it's more material to sift through, meaning a greater chance they'll take the impatient route and just submit an issue/discussion.

Does this make sense…?

Sainan commented 1 year ago

I guess. What about the other approach of *.pluto linguist-language=Pluto,Lua? I don't see how this kind of fallback would lead to confusion to the end-user and could be explained in a sentence in the docs.

spenserblack commented 1 year ago

Side note: this search should help you find projects setting Pluto to Lua, which you could use in a script to raise issues announcing Linguist's official support of Pluto and asking them to remove that override. It's not uncommon to bulk create issues when there's a significant change to a project's ecosystem that requires action by repo maintainers.

Alhadis commented 1 year ago

@Sainan That would actually involve a fundamental change to the way language names and aliases are parsed and normalised (where various forms like git-config, GitConfig and .gitconfig are resolved to gitconfig by logic that needs to stay consistent between Linguist and other moving parts in GitHub's system that need to interpret human-readable language names (e.g., code search).

Markdown	Rendered output
~~~markdown ```git-config [core] foo = true ``` ```GitConfig [core] foo = true ``` ```.gitconfig [core] foo = true ``` ~~~	```git-config [core] foo = true ``` ```GitConfig [core] foo = true ``` ```.gitconfig [core] foo = true ```

This might also complicate attempts to extend tagged code-blocks in future, such as to support specific languages in diff blocks: ~~~diff,js let boolean = true; -this.is.a.diff(true); +this.is.a.diff({withJSHighlighting: false}); // Comments, etc ~~~ In other words, while it might be easy on Linguist's part to make the surface-level changes, having consistent behaviour across the site is a different issue entirely.

Sainan commented 1 year ago

I can appreciate wanting consistent behaviour across the board, but I'm sure there will be some separator character that will be appropriate for each use case, e.g. just picking from the popular ones (,, ;, :, @, #, %), we could have diff@javascript for the future diff expansion, and maybe V;Go for the fallback behaviour.

Sainan commented 7 months ago

I've recently been introduced to Twig, and what's interesting is that Twig is kind of a meta-language and editors like IntelliJ add their support for Twig on top of whatever the lower-level language is such as HTML or Markdown.

So, I think this layer-based approach is an interesting way of looking at it, so you could say twig,html, diff,JS, V,Go, or Pluto,Lua. This could accomplish highlighting for diffs (as well as universal highlighting for Twig), and if unknown "layers" are ignored, it could also be used for fallbacks.

github-linguist / linguist

`linguist-fallback-language` attribute #6395