github / innovationgraph

GitHub Innovation Graph
https://innovationgraph.github.com/
Creative Commons Zero v1.0 Universal
365 stars 34 forks source link

Consider separating "programming languages" into logical high level groups #10

Open brphelps opened 11 months ago

brphelps commented 11 months ago

Generally I think it's harder for me to wrap my head around the graph we currently have because it's mixing things that are used for different purposes.

For example, Makefile / Shell / Dockerfile / Batchfile / Powershell all seem to be a pretty different logical grouping than Java / C / C# / typescript / ruby etc. It's hard to really make sense of what seeing the plots side by side means . Have you considered showing labels or groups or something that would allow the viewer of the graph to separate on an axis that shows more interesting views? Like "Imperative programming languages"? "shell" languages? Build scripts?

Mostly I'm interested in being able to see things visualized in for categories that are "apples to apples".

mlinksva commented 11 months ago

It's something others could experiment with using the raw data, would love to see that.

We could consider adding to the site especially if there's an external set of classifications we could rely on. Linguist (which GitHub uses to detect languages) does have a "group" field for some languages, eg many shell languages are in group shell. Wikidata is another potential source, though perhaps too many classifications, eg look at the "instance of" and "programming paradigm" properties for C# https://www.wikidata.org/wiki/Q2370

I welcome other ideas.

brphelps commented 11 months ago

I think having a dropdown that correlated to any grouping mechanism that seemed useful (what you're mentioning for linguist for example) would be pretty useful!

mlinksva commented 11 months ago

Will have to verify this, but looking at https://github.com/github-linguist/linguist/blob/7ca3799b8b5f1acde1dd7a8dfb7ae849d3dfb4cd/lib/linguist/languages.yml#L30-L31 it appears the linguist languages in a group are already coalesced:

# group                 - Name of the parent language. Languages in a group are counted
#                         in the statistics as the parent language.

Will probably have to look for other ways of classifying. Love the idea though, not only to see only apples-to-apples, but also to (by rolling up) apples-vs-oranges trends, such as "system" vs "application" vs "scripting" languages.

Or even (currently of interest to policymakers, always a relevant audience for this project) memory-safe vs not languages. 😄