Allow overriding with custom language name and/or implementation name.

davidvontamar commented 5 years ago

I have read the following issues entirely: #2627, #2360, #2598. However, this feature request addresses a different problem with a couple of entirely different solutions that also address the concerns mentioned in these issues.

Preliminary Steps

Please confirm you have...

[x] reviewed How Linguist Works,
[X] reviewed the Troubleshooting docs,
[X] considered implementing an override,
[X] verified an issue has not already been logged for your issue (linguist issues).

Problem Description

Different implementations or specifications (specifically major language versions) of the same language may sometimes result in incompatibilities such as different syntax, new/deprecated language features, functions, builtin libraries, etc.

Some projects are forced to maintain legacy code which was written in old iterations of the same language, but is still supported, or otherwise, code that targets a specific implementation that ended up being a language fork and is no longer conforming to the standardized specification.

Notable examples include Fortran (.F vs .f90 - almost two different languages), Lua (LuaJIT vs Lua 5.3 - this case involves mutually incompatible features, and entirely different interpreters for different purposes), C# (or any CLR language really, C# 6.0 in Mono vs C# 8.0 in .NET Core vs older C# versions that targeted older iterations of the .NET Framework), Python (2 vs 3), any instance of a language that was forked in some way (for any reason) is also subjected to this case.

The repository could declare which exact versions or implementations of a language are being used in what files/directories to clarify and monitor that in the language statistics, therefore the problem is cosmetic in nature.

Possible Solutions

This issue has 3 different possible solutions in terms of feature request:

Allow the repository to declare a custom language-implementation attribute in .gitattributes as an arbitrary string, and display this value in parenthesis next to the name of the language that is already detected by Linguist.

Hypothetical examples:

65.00% C++ 15.00% C 10.00% Lua 10.00% Other

Would appear as:

20.00% C++ 45.00% C++ (C++98) 15.00% C 10.00% Lua (LuaJIT) 10.00% Other

Another example with Fortran:

100.00% Fortran

Would appear as:

40.00% Fortran 20.00% Fortran (FORTRAN 77) 20.00% Fortran (FORTRAN 66) 20.00% Fortran (FORTRAN IV)

The same as above, but in case the name of the implementation/version was mentioned, then replace the label entirely:

20.00% C++ 45.00% C++98 15.00% C 10.00% LuaJIT 10.00% Other

And:

40.00% Fortran 20.00% FORTRAN 77 20.00% FORTRAN 66 20.00% FORTRAN IV

Allow to define a custom language with name, color and optional known syntax highlighter by the Linguist. (since you can already instruct Linguist that a Java file was actually a misidentified C# file, or vice versa). This solution is the most desirable because it covers the problem mentioned above as well as allows to some other developers to associate their files with their own private languages or forks of known languages and indicate this to other users of GitHub.

*.cc linguist-language=C(name: "C with Classes", color: #abcdef)

This would group all files that end with .cc and show them as C with Classes using the specified color and the already existing syntax highlighter for C that is provided by Linguist. At global search results you could either count it as plain C (because the user chose the C highlighter) or ignore it altogether.

Regarding #2360:

If the user specified that the custom language uses a syntax highlighter of an existing language, then you could possibly treat this custom language as the other language where it derives its syntax to narrow GitHub search results or global language trends, or rather ignore it at all.
If the user didn't specify a syntax highlighter at all, then ignore it completely from any search results outside of the repository if there's so much of concern to keep the search results clean from obscure or unknown user-defined/forked languages.

Regarding both #2627, #2598:

The solutions I have proposed do not require the execution of any code (such as one may require to define a whole new syntax highlighter), but only the input of custom string values from the .gitattributes file, thus they don't pose any potential security vulnerabilities or legal issues with licensing.

Alhadis commented 5 years ago

I understand where you're coming from, but this would quickly become problematic for lesser-known languages which evolve much quicker, and/or with less noticeable changes to syntax or semantics.

Moreover, how would this benefit users aside from (possibly) improved highlighting? Even syntax highlighting grammars can be improved to accommodate for implementation-specific discrepancies, either with creative TextMate hacks or simply revising scope-name choices (which affect the colours used to highlight code on GitHub).

We're already in the process of disambiguating what Linguist considers to be a "language group" (something which was ill-defined to start with), and introducing a new categorical tier is going to make it hard — if not impossible — where the boundaries lie between "group", "language", and "implementation". For C/Fortran, the distinction is obvious, but much less so for entries like Assembly, which cover a vast multitude of dialects, revisions, and what one might call "implementations".

davidvontamar commented 5 years ago

I understand where you're coming from, but this would quickly become problematic for lesser-known languages which evolve much quicker, and/or with less noticeable changes to syntax or semantics.

I don't know how could this harm the popularity of young and evolving languages given that the custom names a user may set are relevant only within the local repository.

Could you provide with a brief example?

Moreover, how would this benefit users aside from (possibly) improved highlighting? Even syntax highlighting grammars can be improved to accommodate for implementation-specific discrepancies, either with creative TextMate hacks or simply revising scope-name choices (which affect the colours used to highlight code on GitHub).

I wasn't suggesting user-defined custom highlighting because it was already dismissed in the past several times as "prone to Turing-complete vulnerabilities" if I inferred that correctly (that arbitrary code may run at other machines and cause unpredictable side effects). My suggestion was far more simple than that with no potential side effects at all, because all that it does is merely grouping files under a new name & color.

We're already in the process of disambiguating what Linguist considers to be a "language group" (something which was ill-defined to start with), and introducing a new categorical tier is going to make it hard — if not impossible — where the boundaries lie between "group", "language", and "implementation". For C/Fortran, the distinction is obvious, but much less so for entries like Assembly, which cover a vast multitude of dialects, revisions, and what one might call "implementations".

I've seen that effort. I think it addresses something quite different. The main differences are:

4291 attempts to change the way languages are being classified in GitHub by default. Whereas my issue addresses a situation when a user insists to indicate that they're utilizing a specific language implementation, version, fork or dialect which the maintainers of Linguist may not even be aware of, or simply disagree with its classification as a separate language, thus leaving the user without any ability to classify their own repository as they wish.
4291 doesn't address a situation where a user decided to define their own dedicated language for their specific project needs (#4291 either assumes that the language must be widely used, or otherwise nonexistent).
4291 is agnostic to language implementations. (as you have noted) whereas this is at the center of my issue. If I create a new project in LuaJIT, I most definitely don't want it to be classified or associated with the standard PUC Lua 5.3 (they're literally incompatible, not just syntactically, but also functionally, the sole reason they're "Lua" for Linguist is because it cannot make a distinction without false positives in this case), or rather with eLua. Those three target absolutely different platforms and use cases, you'd also notice different programming styles in each of these due to the nature of their implementations. And yet they're all under one language in Linguist. And that's the main problem with it. Linguist can help a user to classify a repository at first look, but it should not enforce itself upon a repository when the user knows better.

Alhadis commented 5 years ago

Could you provide with a brief example?

JavaScript. New features are being added every year, with each year witnessing a new (formally defined and name) implementation of the language. Things get even messier when you consider precompilers like TypeScript (which are effective supersets of JavaScript) and JSX extensions (which blur the lines between non-standard features and user-submitted proposals.

Note how many "Presets" are listed by Babel's REPL. Those have only come into existence in the last ~5 years: this is the level of fragmentation we need to consider.

My suggestion was far more simple than that with no potential side effects at all, because all that it does is merely grouping files under a new name & color.

This is the real deal-breaker:

… under a new name & color.

User-defined languages aren't a possibility, given the mechanics of Linguist and GitHub's indexing engine, and although I don't purport to know the exact logic behind it all, I can tell you this feature would involve full-blown overhaul of GitHub's internals, affecting everything from language searches to trending repository listings.

In the end, we're benefiting only a minority of users, whilst impacting millions of others. We can't cater to everybody, and the classification system we have in place at the moment is the end-result of years of feedback and refinement. I don't think I've seen this feature suggested before, so I'm inclined to think most users wouldn't use/need it.

If two implementations of a language are decidedly different enough to be considered distinct, then they should be considered separate languages (e.g., Perl / Perl 6).

Alhadis commented 5 years ago

I should point out that topics are a helpful way of classifying repositories with author-defined details. For example, there are 262 repositories tagged with luajit, 424 repositories tagged with mono, 67 repos tagged with fortran90, and so forth.

This is arguably a better solution for making implementation details visible to users, and you aren't limited to defining implementation-related keywords either.

davidvontamar commented 5 years ago

Could you provide with a brief example?

JavaScript.

Oh, it's ECMAScript! This time the name of the implementation actually won. (I'm not trying to making any point with it, I just mentioned a fact, that's all.)

This is the real deal-breaker:

… under a new name & color.

Cosmetic options are deal breakers. I see.

User-defined languages aren't a possibility, given the mechanics of Linguist and GitHub's indexing engine, and although I don't purport to know the exact logic behind it all, I can tell you this feature would involve full-blown overhaul of GitHub's internals, affecting everything from language searches to trending repository listings.

If I recall correctly, users may actually define new 'topics' for their repositories. Those topics are later used by GitHub's search engine and other internals as well.

Shouldn't the same interface apply to programming languages at some point?

The classification system we have in place at the moment is the end-result of years of feedback and refinement.

Those "years of feedback" also included desperate requests from users who wanted to classify their own repositories with own shell dialects or obscure Domain Specific Languages.

I don't think I've seen this feature suggested before, so I'm inclined to think most users wouldn't use/need it.

I've linked at least 3 issues that included almost identical feature requests by other users.

If two implementations of a language are decidedly different enough to be considered distinct, then they should be considered separate languages (e.g., Perl / Perl 6).

If you'd want me to classify Lua this way I'd end up with at least four distinct dialects (<5, 5.1, 5.3, JIT) and it keeps changing. Code from 5.1 is incompatible with Lua 5.3, for example. Same goes for JIT which is not 5.1 exactly nor 5.2, but something in between, and has its own innovations too (like FFI Semantics as the author puts it).

Other problems in trying to classify Lua dialects with Linguist is their syntactic similarity, the same file extensions, but they function quite differently with many features being added & deprecated.

But if Fortran's grouping is acceptable to you, then how could I even make a case for Lua? I'm also puzzled how did Fortran end up being a group of languages while ECMAScript diverged into separate implementations?

Alhadis commented 5 years ago

Those "years of feedback" also included desperate requests from users who wanted to classify their own repositories with own shell dialects or obscure Domain Specific Languages.

I believe the missing feature is support for user-defined languages, and how a user decides to define a language is arbitrary. They might be authoring a new language, or, like you, have a wish to differentiate between dialects and major language revisions. Your suggestion specifically concerns the latter, and would be adequately addressed by the addition of user-defined language support. Which, yes, is a well-acknowledged limitation of ~~Linguist~~ GitHub in general.

I've linked at least 3 issues that included almost identical feature requests by other users.

Almost identical? That's quite a leap from the OP:

However, this feature request addresses a different problem with a couple of entirely different solutions

… how did Fortran end up being a group of languages while ECMAScript diverged into separate implementations?

Are you seriously comparing a 62-year old, pioneering language with one that evolved in barely two decades and started life as a proprietary scripting language?

Shouldn't the same interface apply to programming languages at some point?

Why? How is the topics feature inadequate?

If you'd want me to classify Lua this way I'd end up with at least four distinct dialects (<5, 5.1, 5.3, JIT) and it keeps changing.

If you're unsatisfied with the way Lua and Fortran are currently classified, then I recommend submitting a pull-request to break them into separate languages. Changing site-wide mechanics to benefit a handful of languages is neither feasible nor practical.

If I recall correctly, users may actually define new 'topics' for their repositories. Those topics are later used by GitHub's search engine and other internals as well.

That feature was added more recently, and should already be adequate for declaring things like dialects or language versions.

pchaigno commented 5 years ago

@david-tamar Please don't be discouraged and close this issue before anyone else than @Alhadis has had a chance to look into it and give their opinion. I often agree with @Alhadis on these issues, but here, I'm not sure to fully understand what you want, and I'd prefer to understand before I make up my mind.

given that the custom names a user may set are relevant only within the local repository.

Are you proposing that the custom language name only be taken into account inside the repository? So users couldn't search for that language on the whole GitHub.com? Is the idea only to give more detailed information in the language bar (e.g., C++98 vs. C++)?

I wasn't suggesting user-defined custom highlighting because it was already dismissed in the past several times as "prone to Turing-complete vulnerabilities" if I inferred that correctly (that arbitrary code may run at other machines and cause unpredictable side effects). My suggestion was far more simple than that with no potential side effects at all, because all that it does is merely grouping files under a new name & color

As I understand your suggestion, Linguist would still be in charge of selecting grammars and the users would just chose between them? If there's a better grammar for a language (even a dialect), we usually welcome pull requests to apply that grammar, even in cases where it requires to break a language down into its dialects to apply a different grammar to each (whilst still grouping these dialect under the same parent language). Did I misunderstand your proposal? As I understand it, I'm not sure when it would be useful (?).

davidvontamar commented 5 years ago

@david-tamar Please don't be discouraged and close this issue before anyone else than @Alhadis has had a chance to look into it and give their opinion. I often agree with @Alhadis on these issues, but here, I'm not sure to fully understand what you want, and I'd prefer to understand before I make up my mind.

OK. I'll reopen this issue for more feedback then. I lost nearly all enthusiasm once it was made clear that GitHub's current limitations render this feature request unfeasible. On top of being dismissed as a redundant feature request by @Alhadis.

I wanted to stress out that 'topics' are not a solution because topics cannot track language statistics within the repository per file like Linguist does (resulting in an up-to-date % breakdown of the entire repository according to its actual contents).

Are you proposing that the custom language name only be taken into account inside the repository? So users couldn't search for that language on the whole GitHub.com? Is the idea only to give more detailed information in the language bar (e.g., C++98 vs. C++)?

At first I proposed that yes. Because I wanted to take the path of least resistance with the hope that it won't lead to concerns such as "it'll pollute GitHub's language statistics with duplicate or nonexistent languages".

However then I realized that people are already polluting GitHub's search results & statistics with duplicate topic tags anyway, so it might even make language tags such as "C++98", "CPP98", "cpp98" or "cxx98" as equally legitimate search terms as their equivalent topics would otherwise be.

At the moment having the names of the implementations or dialects indicated within the local repository only would suffice too. My desire is to have precise language statistics in my repository so I can let other people differentiate between source files that belong to different dialects or implementations within my own repository.

As I understand your suggestion, Linguist would still be in charge of selecting grammars and the users would just chose between them?

Yes, I think Linguist does a good job at providing grammar & syntax highlighting, but that shouldn't prevent the user from grouping files under different dialects that may utilize the same grammar (both files grouped under C++98 and C++17 could use the same generic/default C++ grammar for the most part, so it's not a major problem as far as you just want to see two separate groups in your language statistics).

If there's a better grammar for a language (even a dialect), we usually welcome pull requests to apply that grammar, even in cases where it requires to break a language down into its dialects to apply a different grammar to each (whilst still grouping these dialect under the same parent language). Did I misunderstand your proposal? As I understand it, I'm not sure when it would be useful (?).

I'm aware that I may suggest new grammar for languages that I view as distinct dialects. But dialects don't always have drastically different grammar from each other.

This may cause false-positive classifications very often since the major differences between them is not grammatical, but the way they function or being compiled (especially in Lisp dialects or Lua dialects).

Therefore grammar is not necessarily the reason one may want to separate between source files of the same language into different groups.

Features that may differ between specifications and implementations can be such as:

Changes in the standard library.
Altered semantics that are not easily differentiable in the text itself.
Implicit type casting in weakly typed languages.

For example in Lua5.3 there are integers and doubles, but in previous Lua implementations such as Lua5.1 there were only doubles, and it cannot be determined from the text itself because Lua is weakly typed and they're all .lua and have similar grammar.

Generally speaking code written for 5.1 won't work for 5.3 or vice-versa, same for the JIT dialect and other old Lua dialects, those differences are not easily understood in the text itself, until you attempt to execute the sources while targeting the wrong interpreter or implementation.

Alhadis commented 5 years ago

It's worth mentioning that the language stat-bar won't always be visible depending on the viewer's device and/or platform; e.g., no statistics are displayed on mobile, where only a "View code" button is offered.

This means information the author intends to display to users won't always be available, depending on how they're browsing your repository.

davidvontamar commented 5 years ago

It's worth mentioning that the language stat-bar won't always be visible depending on the viewer's device and/or platform; e.g., no statistics are displayed on mobile, where only a "View code" button is offered.

This means information the author intends to display to users won't always be available, depending on how they're browsing your repository.

Well, that has more to do with the shortcomings of the mobile interface than with this issue in particular then.

Anyway, this is not entirely accurate. I've opened up GitHub on my phone right now and search results do show the name of the language used in each repository.

This information is taken from the language statistics that are calculated by Linguist.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

francis94c commented 4 years ago

Since i found my way here, I probably have to implement syntax highlighting in Atom by writing a package for it. It's okay if it's plain when rendered in Markdown...

ghost commented 4 years ago

Has there been any progress on this? For tiny homebrew languages this would indeed be a really useful and motivating change, otherwise it's like I'd even turn off the language statistics entirely if I could because if they're plain wrong what's the point. And as I think everyone agrees, including every tiny language that might be discontinued soon into linguist's github-wide language list isn't sensible, but on a per-repo basis it absolutely might be. Surely it can't be that hard to allow a different colored entry on the bar based on some .github-linguist.yml in a repo which could overwrite a file extension?

Also can't this issue be marked such that the stale bot stops messing with it? It's not like it'll magically solve itself.

Edit: as for @francis94c seems like you possibly took a wrong turn, maybe try here? atom isn't really related to this issue here

lildude commented 4 years ago

Has there been any progress on this?

Nope because this requires more than just changes in Linguist and thus requires buy-in and "product sponsorship" for the GitHub.com engineering side of things first.

I opened an issue in the private GitHub org repo for this back in 2018 and regularly update it with new requests.

Also can't this issue be marked such that the stale bot stops messing with it? It's not like it'll magically solve itself.

As this is dependent on changes outside of Linguist, I don't think there's any value in keeping it open here, hence I've allowed it to auto-close.

ghost commented 3 years ago

As this is dependent on changes outside of Linguist

@lildude can you explain what you mean?

I'll elaborate why what you say confuses me: doesn't the site frontend just display the final percentages & colors & names as-is exactly as they are handed over by linguist? How would this change have any UI impact except 1. bad words could show up in 100 rather than previously 99 places (in language list in addition to desc, title, every repo file, every issue title, ...) and except 2. now there'll be magically slightly more correct and complete language listings for the repos that make use of this? How would be even just one additional button be needed on the GitHub UI end for this, so how does it depend on "changes outside of Linguist"?

Like I get this is still effort to implement on the linguist side and therefore maybe not high on the priority list, I'm just very confused about the "site"/product management end of things you're hinting at. This makes it sound like even a linguist pull request adding this wouldn't be welcome before lengthy internal discussions about the front end, and some sort of "corporate approval" first as if this were a significantly risky change. I honestly find that quite confusing to follow and maybe it's just me, but I'm simply curious. As a result, I also don't get how just adding this to linguist wouldn't already do the whole job, and why therefore a ticket here isn't the right place.

lildude commented 3 years ago

@lildude can you explain what you mean?

I'll elaborate why what you say confuses me: doesn't the site frontend just display the final percentages & colors & names as-is exactly as they are handed over by linguist?

Nope, and it's important to remember the frontend is not the only consumer of this information; Linguist doesn't even run on the frontend servers. A lot of the info from Linguist is cached and/or stored in the database. In the case of the language name, most (if not all) of the places that display a language name query a single table in the database for the language information which is populated from the languages.yml file and not the repos themselves. This provides a single reliable source for this information for all consumers (frontend, search, APIs etc), removes the need for every component (written in multiple languages) to know about Linguist, reduces unnecessary round-trips to the repo or repo tables and also allows for uniformity in language information across the site.

Implementing the ability to set a custom name at a repo level would require changes on the GitHub side to store this potentially unique information somewhere, and then query this additional location every time a language name is needed, eg for the frontend, code search, the APIs etc. This is the part that needs requires buy-in and "product sponsorship".

Nakilon commented 3 years ago

The "stale issues" automation is just a cancer. So many issues are left unresolved because of it.

Bump. I made a language and I want stats of code written in it be shown in sidebar of its repository.

Justin712 commented 3 years ago

I'm pretty incredulous that this hasn't been implemented, this feature has been requested for over 5 years. Developers writing small languages just want the extra motivation of seeing that percentage bar labeled with their project name.

ghost commented 3 years ago

please, this should implemented

i know it would cause tempering with already code, but still it would make this project more popular and i believe it would be a good idea.

rowan-sl commented 2 years ago

I hope this will be implemented eventualy, as I am now writing a program with a custom language, and it will never become popular, but it would be so nice to have it show up on github as a language.

omdxp commented 2 years ago

@rowan-sl same here, I hope they see it

IsaccBarker commented 1 year ago

An, albeit very hacky, solution would be to add a "fake language" to languages.yml, and allow something like this in .gitignore files:

*.foo linguist-language=Unseen linguist-detectable linguist-color-override=0xff00ff linguist-name-override=FooLang

This, of course, requires the introduction of the linguist-color-override and linguist-name-override Git attributes, but (as far as I know), wouldn't require any changes to GitHub's backend. Those should probably be locked down to only the "Unseen" language, though.

plotfi commented 1 year ago

Why is this issue closed? I am working on a custom language and cant get GH to classify my code properly. My file extensions end in .mu but I cant get it to not classify as mupad.

ell1e commented 11 months ago

For what it's worth, gitea has implemented this now which makes it available on gitea-based code hosting platforms like codeberg.org. So clearly it's a doable feature.

github-linguist / linguist