github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.27k stars 4.25k forks source link

Revisiting language groups #4291

Closed Alhadis closed 4 years ago

Alhadis commented 6 years ago

This issue is a continuation of what @pchaigno started with github/linguist#3093:

This pull request makes SASS a language of its own, distinct from CSS. This change was requested several times in #2933, #3084, #2585 and #2650.

There are several languages on GitHub which presently fall under the usage statistics of another, "parent" language, which certainly deserve reconsideration — or at the very least, some public discussion for highlighting the reasons why they fall under another language's umbrella.

To start, here are the languages which I believe are valid candidates for degrouping. I'll extend this list over time as discussion from other users confirms other candidates:

Languages which should be degrouped

Candidate Currently grouped under Reason(s) for decoupling
Svelte HTML https://github.com/github/linguist/issues/4291#issuecomment-569400535
Sass/SCSS CSS Extremely different syntax and semantics. Sass has programmatic features and some "object-oriented" features; CSS is strictly declarative.
Less CSS See above. Less's syntax is much closer to "pure" CSS than Sass/SCSS, but it's still programmatic in nature and considerably different enough to warrant separation.
JSON Fixed in #4345 JavaScript JSON is a general-purpose data serialisation language, and virtually every modern programming language has support for reading and parsing JSON syntax to some extent (either natively or via a library). It's the closest thing we have to a universally interoperable data exchange format.

Moreover, there's little point in retaining a connection with JavaScript. JSON is classed as a data language, so it won't appear in usage statistics anyway.

I've refrained from bringing up any languages I've never worked with or lack familiarity with (such as PostCSS and Stylus), each of which might be candidates as well. Comments are welcome.

Good examples of language groups

Here are some languages which are justifiable in having a parent language:

/cc @pchaigno, @lildude, @controversial, @nazar-pc, @EmmaRamirez, @plibither8

Footnotes

  1. Regarding an argument @arfon made in #3093:

My rationale for not doing this is that SASS is almost always used to generate CSS (please correct me if I'm wrong here) and so it makes sense (to me at least!) to have this listed under CSS for the general repository stats.

Pic is interesting because it can be compiled to other languages that aren't Roff (like SVG or TeX), but the language itself is based upon Roff syntax and even permits low-level Roff constructs to be used inline. In other words, it's not so cleanly separated, and demonstrates why transpilation targets are fallacious reasoning w.r.t. whether Sass should be distinguished from CSS or not.

plibither8 commented 6 years ago

Analogous to the SCSS-CSS argument, Pug (Jade) and other templating languages that are radically different from HTML should also be considered, as they currently fall in the HTML group.

Alhadis commented 6 years ago

Agreed. Personally, I think most (if not all) templating languages should be decoupled from their target output. There's a reason they're templating languages, after all... and it isn't "just HTML" if I open a Pug template in a browser and see a weird mix of half-empty tags and loops. 😉

I think if a parent language is unambiguous and well-specified (as HTML and CSS are), a child language should be either a subset or a hybrid of different languages. Conversely, Assembly and Shell are umbrella terms of sorts which cover numerous dialects and implementations, so having them as parent languages makes more sense, IMHO.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

pchaigno commented 5 years ago

I think this is a pretty good list to start with and I doubt we'll be able to have a comprehensive list (we don't know and use all languages Linguist supports ourselves). To move forward, should we agree on a short guideline to decide if languages should be grouped in the future (so that we can better handle future cases we missed here)?

I think that comment by @Alhadis is a pretty good starting point:

I think if a parent language is unambiguous and well-specified (as HTML and CSS are), a child language should be either a subset or a hybrid of different languages. Conversely, Assembly and Shell are umbrella terms of sorts which cover numerous dialects and implementations, so having them as parent languages makes more sense, IMHO.

@lildude What's your opinion on this?

Alhadis commented 5 years ago

I've pulled JSX from the list. For a start, it's not as clear-cut as TypeScript is (Flow typing and JSX tags both fall under the umbrella of "JSX", more or less). Plus the distinction itself is problematic for reasons I've explained here.

lildude commented 5 years ago

Whoops, lost this in my inbox at some point and was just reminded by @Alhadis in https://github.com/github/linguist/issues/4353.

I think that comment by @Alhadis is a pretty good starting point:

I think if a parent language is unambiguous and well-specified (as HTML and CSS are), a child language should be either a subset or a hybrid of different languages. Conversely, Assembly and Shell are umbrella terms of sorts which cover numerous dialects and implementations, so having them as parent languages makes more sense, IMHO.

@lildude What's your opinion on this?

Seems reasonable to me.

Languages which should be degrouped

... as does this. Do it.

Alhadis commented 5 years ago

... as does this. Do it.

I'm gonna enjoy this...

Alhadis commented 5 years ago

Just an FYI: this might take a while because of conflicting colour proximities. 😅

lildude commented 5 years ago

Just an FYI: this might take a while because of conflicting colour proximities.

We might be able to get rid of that soon 🤞 I had a chat with a colleague and your suggestion at https://github.com/github/linguist/pull/4331#issuecomment-443419513 may become a thing 🔜.

Alhadis commented 5 years ago

Holy shit. 😮 🎉 🎉 ❤️

Alhadis commented 5 years ago

Guys, I've pushed a WIP branch for the degrouped languages I'm familiar with, but I'll hold off from submitting a PR until some time has elapsed (or until the potential changes have been reified).

In the meantime, feel free to push any changes you think are missing or necessary. 👍

Alhadis commented 5 years ago

... of course, when pushing topic branches, it'd help if I actually had commits to go with them.

Remind me not to leave changes staged for several hours, because my crap memory will have me believing they've already been committed. 😁

Alhadis commented 5 years ago

@lildude I realised another reason why the colour-proximity thing is strangling us — Language authors gravitate toward vibrant colours when deciding their project's logo/branding/colour-scheme. So over time, more and more languages will be added to Linguist with clashing colours: bright red, dark blue, bright blue, purple, warm yellow, etc.

So the remaining "available colours" we can assign them will inevitably be sickly shades of pale green, washed out red, white-ish pink, etc. The current constellation of colour choices is already proving this: when adding Asymptote, I noticed its official colour was #FF0000 (bright-red). That clashed with PostScript, Mercury, Red (the language, lol), Ruby, and several others which were likely "pushed" away from their official colours shades due to the colour-proximity requirements.

Having said that, there's no way I'm gonna add 12 uncoloured/grey languages that were degrouped from their parent languages, most of which have branding with vibrant, distinctive colour choices. Nor do I want to drop 12 grossly inaccurate colour-choices into the language bar to represent Less, SASS, etc.

lildude commented 5 years ago

@Alhadis I hear you, and hopefully we can remove this once https://github.com/github/linguist/issues/4291#issuecomment-447378743 happens. It's on a team's radar, just need to see it come to fruition.

lildude commented 5 years ago

Especially for you @Alhadis 😘

github_linguist__language_savant__if_your_repository_s_language_is_being_reported_incorrectly__send_us_a_pull_request_

Changelog entry

Alhadis commented 5 years ago

This is the happiest day of my life, holy shit. 😀

What should we do about the colour-proximity check?

wopian commented 5 years ago

Would this mean existing languages will be able to get their official colour after the proximity changes now that there's a separator?

Alhadis commented 5 years ago

Yes!

pchaigno commented 5 years ago

Should we keep some semblance of color proximity detection though? If only to prevent all colors from becoming blue... I was thinking we could simply relax our proximity constraint?

Alhadis commented 5 years ago

If only to prevent all colors from becoming blue...

That's a non-issue, and only likely to be noticed in repositories which contain multiple languages that incidentally use almost identical colours.

pchaigno commented 5 years ago

repositories which contain multiple languages that incidentally use almost identical colours

Isn't this a kind of birthday paradox and therefore the probability of that happening is actually higher than one might expect? :p

@Alhadis Do you think we should remove the constraint on colors entirely?

Alhadis commented 5 years ago

Yes, I do. There's no good reason for policing colour choices anymore, and I can't see any reason why confusingly-similar/adjacent colours in a language bar could pose any sort of a problem.

Also, it enables us to restore colours to data/prose formats, which is relevant now that we have the language-detectable attribute. Users who override it are seeing grey bars.

pchaigno commented 5 years ago

Perhaps you're right and we should just get rid of it. It certainly would be nicer for contributors. I'm just a bit wary of making a change that will be hard to rollback, without knowing how language colors will be used in the future on github.com.

/cc @lildude What's your opinion? Any reason we shouldn't get rid of that color proximity test?

Alhadis commented 5 years ago

without knowing how language colors will be used in the future on github.com.

That's up to GitHub's design team to decide and deal with, not us. 😉 Designers are used to working within constraints (the limitations imposed on the use of a company's logo is more burdensome than making random colours stand out).

It's also impossible to justify the colour-proximity check to future contributors who don't expect a new language's official colour to be "taken". And the logic for discerning similar colours was far from infallible to begin with...

Alhadis commented 5 years ago

@lildude Any official word from GitHub concerning the removal of the colour-proximity tests?

lildude commented 5 years ago

@lildude Any official word from GitHub concerning the removal of the colour-proximity tests?

Ooops. Not yet as I forgot to open the issue 😊. Just opened an issue seeking feedback from our design team. Will let you know more when I do.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

BenEmdon commented 5 years ago

Hey y'all 👋 Thought I'd add my thoughts on this too.

After using Ruby .erbs in a number of non HTML ways (like code generation), I wonder if we should consider reclassifying it? What options are available?

CC: @Alhadis @pchaigno

Alhadis commented 5 years ago

Could you post a code sample of what you mean? I can't fathom how HTML markup could be making code generation easier...

In any case, unusual use-cases of a language benefit from a linguist-language override for the affected files... =)

BenEmdon commented 5 years ago

.erbs just seem like a means to template text (of any kind). There doesn't appear to be anything HTML specific about them.

ERB filenames have a preceding filetype in their name. A HTML ERB would have the filename name.html.erb, while a Java ERB would have the filename name.java.erb, and a conf ERB would have the filename name.conf.erb.

It seems like we could infer the language type of an ERB file from it's preceding filename. How do you feel about this? 👍 👎

Examples of ERBs being used for code generation

Alhadis commented 5 years ago

Ah, I see. So it's really more of a generic templating system that (naturally) lends itself well to server-side HTML rendering? If it isn't HTML-centric, it might make sense to rename it to Embedded Ruby instead (as well as degrouping it).

However, that'd still be of minimal benefit to syntax highlighting and language classification. Because Linguist is limited to classifying languages that've been registered ahead of time, it'd be impossible to classify files as, say, Java+ERB, INI+ERB. So, the best we can do is rename it to something more appropriate and/or make it a child-language of Ruby.

I'm really not the right person to be discussing anything Ruby-related, though. Since I've no knowledge of what ERB files are really used for, I can't confidently assert my suggestions are suitable (are these code-generation cases only 10% of ERB-using repositories? A third? ~50%?). @lildude would be right person to ask about this, but since he's currently @busydude, it's probably safer to leave this matter be for now. =)

BenEmdon commented 5 years ago

The idea of renaming it to embedded ruby –a child language of ruby seem acceptable to me.

I would speculate that HTML+ERB is the most dominant variant due to the popularity of Ruby on Rails. Should the HTML+ERB variant stay classified as an HTML like language, since HTML rendering is still a major use case for ERBs?

Alhadis commented 5 years ago

Yes, I think so. Exceptions can always use a .gitattributes override to flag it as another language (which affects syntax highlighting too). Granted, this means they're limited to either Ruby or whatever language is being templated... but it's better than (mis)classing it as partly HTML.

BenEmdon commented 5 years ago

I don’t mind proposing the change in a PR :smile: Are there other PRs which did something similar? Pointing me to another PR would help me get a head start!

Alhadis commented 5 years ago

Renaming a language is a simple procedure (though that wasn't always the case…). You can use #4171 as an example. Basically, it's just:

  1. Rename entry in ./lib/linguist/languages.yml. Then,
    1. Keep the list alphabetised. Ordering is case-sensitive (so sorted in binary order: uppercase before lowercase).
    2. Remove the entry's group: HTML field. If you want the entry to contribute to the usage statistics of Ruby, replace the line with group: Ruby instead.
  2. Rename samples/ directory: ./samples/Old Name/./samples/New Name/
  3. Scan the following files/directories for mentions of the old name. Update each file accordingly (remember, language names are case-sensitive):
  4. Run bundle exec rake samples to update samples database.
  5. Run script/list-grammars to regenerate the grammars list.
  6. Run bundle exec rake test to run Linguist's test suite. Anything you've missed will display loud hairy feedback: you'll know when you've covered everything. 👍

Sidenote

I just realised we could always keep HTML+ERB and add Embedded Ruby as a separate language. The extensions of HTML+ERB could target .html.erb and .html.erb.deface, whilst the new Embedded Ruby language could simply target .erb more broadly. This is much more complicated, and would necessitate the addition of heuristics and regression tests to disambiguate... however, this feels to me like it might be the winning solution.

Again, I'd wait for @lildude's input before rushing off to submit a PR. Should my solution be found preferable, well, your PR will have been in vain. 😉

BenEmdon commented 5 years ago

I just realised we could always keep HTML+ERB and add Embedded Ruby as a separate language. The extensions of HTML+ERB could target .html.erb and .html.erb.deface, whilst the new Embedded Ruby language could simply target .erb more broadly. This is much more complicated, and would necessitate the addition of heuristics and regression tests to disambiguate... however, this feels to me like it might be the winning solution.

I agree with this 👍 I'll wait on @lildude input before tackling this.

ObserverOfTime commented 4 years ago

Svelte should be removed from the HTML group. It's similar to Vue which is already on its own.

hilder-vitor commented 4 years ago

The language Sage is really built on top of Python and their syntaxes are almost the same, but they are not 100% equal. For example, y^3 computes the cube of y in Sage instead of the y XOR 3, as in Python. Moreover, R.<t> = QQ['x'] is a valid line of code in Sage, while in Python it raises a SyntaxError.

Besides the syntax, the other features are very different. For instance, Python treats mathematical expressions numerically, while Sage treats them symbolically, thus, 1/3 is 0.3333 in Python, but it is a fraction in Sage, and sqrt(2) is 1.4142 in Python, but it is, well, sqrt(2) in Sage.

Even if one has never declared x anywhere, the following is a valid Sage script which prints -1:

sage: f = cos(x)
sage: f(x = pi)
-1

All that said, I would like to invite you to consider degrouping Python and Sage.

Alhadis commented 4 years ago

@hilder-vitor You should submit a pull-request to degroup them; this thread is chiefly for discussing languages whose "independence" is ambiguous and open to debate. There's clearly no ambiguity or room for subjectivity in what you've described.

Alhadis commented 4 years ago

@ObserverOfTime I missed your comment when you posted it. I've added Svelte to the list.