github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12k stars 4.16k forks source link

Visual programming language files are incorrectly not classed as programming #4508

Closed mxmilkiib closed 5 years ago

mxmilkiib commented 5 years ago

Preliminary Steps

yes/done

Problem Description

PD files aren't detected and listed

URL of the affected repository:

Examples:

https://github.com/MikeMorenoAudio/EP-MK1

https://github.com/danomatika/BangYourHead

Last modified on:

15th Feb

29th Jan

Expected language:

Pure Data

Detected language:

Nothing

pchaigno commented 5 years ago

Looks like these .pd files are detected as Pure Data. They don't show up in the language bar because GitHub doesn't display "data languages" by default. You can use overrides to change that.

mxmilkiib commented 5 years ago

A-ha, thank you for the clarification.

I've just checked through my GH settings, but I'm not sure what you mean by overrides? Or do you mean if I run/host linguist myself?

I guess this repo isn't for making GH policy suggestions, but might you be able to advise if GH notes the definition of a "data language" somewhere? Is this somehow related to the notion of dataflow languages?

Alhadis commented 5 years ago

@mxmilkb Your questions are already answered by the steps you neglected to follow in the issue template:

Preliminary Steps

yes/done

Here they are again for your reference:

Please confirm you have...

Please review these preliminary steps before logging your issue. You may find the information referenced may answer or explain the behaviour you are seeing. It'll help us to know you've reviewed this information.

mxmilkiib commented 5 years ago

Thank you for the prompt reply, but possibly you fail to see the ambiguity here?

GitHub the service could have per-user hooks into linguist overrides, but I take it that it doesn't.

I've just double checked, but none of the four steps say why .pd files are defined as a "data" language".

Alhadis commented 5 years ago

The data type is a bit of a catchall, but generally used for languages which aren't written by hand (or those which are, but aren't programming, document markup, or lightweight markup languages; e.g., most configuration files).

Pure Data's homepage states (emphasis mine):

Pd enables musicians, visual artists, performers, researchers, and developers to create software graphically without writing lines of code.

For program-generated formats like this, the data type is an apt description. If you want it considered as part of a project's statistics, we recommend using the linguist-detectable override.

mxmilkiib commented 5 years ago

Other visual dataflow programming languages like Max (created by the same person as Pure Data) patches are displayed, also are LabVIEW. Pure Data is certainly Turing complete if that matters. I would contend that data is not an apt description (that the definition of data is 'undefined' is problematic and worthy of it's own issue IMHO).

Alhadis commented 5 years ago

Other visual dataflow programming languages like Max (created by the same person as Pure Data) patches are displayed, also are LabVIEW.

That might need investigation then, especially if the formats are simply JSON and XML. It's possible these should have been added as file-extensions of JSON and XML, respectively (unless there's an obvious syntactic difference).

I would contend that data is not an apt description (that the definition of data is 'undefined' is problematic and worthy of it's own issue IMHO).

No, the issue here is that Linguist and GitHub have no concept of a "dataflow" or "visual" programming language. We've experienced a similar issue to this one when Eagle/KiCad projects were redefined as data instead of markup. While the change was justified, it was also disruptive for KiCad/Eagle users, whose projects were suddenly lacking any classification whatsoever. Like Pure Data, these weren't files that were created by humans, but through a graphical interface.

The solution to the KiCad crisis was the introduction of the linguist-detectable override, which is our best possible solution to machine-generated formats that users expect to be part of a project's classification.

Alhadis commented 5 years ago

@lildude @pchaigno Any thoughts on what to do with Max and LabVIEW? It's clear they shouldn't be programming, and or even registered as languages (assuming they lack incompatibility with JSON and XML, respectively).

mxmilkiib commented 5 years ago

Massive apologies to Max and LabVIEW users reading this later if it comes to pass that this weird subjective classification of visual programming language source files affects your user experience of GitHub. Please don't hate me, for I will not have been the one making the change.

I can see the rational for KiCad being classed as data, but I don't think any of the points to raise on that issue relate to this issue.

mxmilkiib commented 5 years ago

Ah, this is interesting - https://github.com/proteusvacuum/KlattSynth - does list Pure Data files in the language bar..

lildude commented 5 years ago

@mxmilkb that'll be because that repo hasn't been touched in over 3 years. In the time since, Pure Data was switched to being classified as data by @Alhadis in https://github.com/github/linguist/pull/3751. Repositories are not reanalysed when Linguist is updated, only when changes are pushed to the repo, hence that repo is still showing the analysis from before @Alhadis's changes were merged and rolled out.

mxmilkiib commented 5 years ago

I would now summarise that the issues at hand are:

The data type is a bit of a catchall, but generally used for languages which aren't written by hand (or those which are, but aren't programming, document markup, or lightweight markup languages; e.g., most configuration files).

"languages which aren't written by hand"

Can I request a rational for that distinction?

Yes, you edit them with an IDE of sorts, but it's still a symbolic programming language at the IDE level. A related example could be Smalltalk, where the concept of an IDE is generally integral to the use of the language.

I'm also confused as to why the feedback on the PR wasn't replied to.

Alhadis commented 5 years ago

Yes, you edit them with an IDE of sorts, but it's still a symbolic programming language at the IDE level.

And where exactly would you draw the line between "visual programming" and "flowchart / mind-mapping software"? If we classify PD as "programming", how do we justify to users why Xcode storyboards and Android layouts (both XML) are "data", despite subjective overlap? See #2818 and #3125 for prior discussion about the latter.

GitHub's userbase spans 28 million people with projects spanning every fathomable interest and background. We can't hope to please everybody, so we give them the option of overriding aspects of Linguist they disagree with, or simply don't fit their projects in question.

mxmilkiib commented 5 years ago

Well, it has inputs, outputs, performs computations to process data between those inputs and outputs. As mentioned before, it's uses the dataflow paradigm, similar to the functional reactive paradigm. It has control flow mechanisms, data types and data structures. It has a standard library as well as external modules to add more functions. Aside from the history of use of the word "programming" already being associated with both Pure Data and the general concept of visual programming languages, Pure Data can do "programming" things, from hello world to advanced DSP, making sequenced/synthesised and generative music, for livecoding, as a UI toolkit, for processing shaders or controlling lighting rigs.

Alhadis commented 5 years ago

making sequenced/synthesised and generative music, for livecoding, as a UI toolkit, for processing shaders or controlling lighting rigs

Yeah, that's where the problems begin. There are many programs which can do that... Xquartz, AudioMulch, Substance, countless procedural texture generators, and even compositing/post-editing suites like Flame and After Effects have railroad/"flow"-like diagrams to programmatically control input/output. These are all programs I've worked with, too, and I know how technical and complex they can become. How are they any different? Because of how the project's data is represented on disk?

One of Linguist's advertised responsibilities is identification of machine-generated formats:

This library is used on GitHub.com to detect blob languages, ignore binary or vendored files, suppress generated files in diffs, and generate language breakdown graphs.

"Machine-generated" here basically meaning "something you didn't write by hand using a text-editor", which leaves no room for ambiguity. Most users consider this desirable, and those who don't have the overrides facility to address projects on a case-by-case basis. The system is simple, but it works.

Now, the credibility of a "visual programming language" is irrelevant here. The key factor is that Pure Data is a program-generated format not designed to be written by hand. This is precisely what Linguist classifies as a "generated file", and if we don't enforce consistency, everything leads to a slippery slope of vague excuses to the next user who tries "lawyering" their favourite program's file-format because it was authored using a similar-looking interface.

mxmilkiib commented 5 years ago

Touching on the key factor first;

suppress generated files

The key factor is that Pure Data is a program-generated format not designed to be written by hand. This is precisely what Linguist classifies as a "generated file"

Where does Linguist define that precisely to be a "generated file"? Does that not mean the like of build artefacts generated by package managers, frameworks, preprocessing and compilation?

A .pd file is something one edits, saves, and then edits again, and can run in various ways (see below).

detect blob languages

Why have you highlighted this? In the context of linguist, does 'blob' not refer to an unclassified file? Or actually the traditional definition of blob, i.e., binary or bytecode blobs (as Assembly and LLVM are classed as languages)?

Assembly and LLVM bytecode are, very much for the most part, not written by hand.

Yeah, that's where the problems begin

I think a problem might better be framed as something that can be actioned against, i.e., not having a stronger definitions of the categories involved, rather than framed against the existence of various forms of software that are fairly generally accepted to be on a different level to visual programming languages.

Xquartz

Which is X for Mac? A display server, so you write code in a language that uses X protocols, so it's appropriate for that code to be classed as the language it's written in, no?

AudioMulch, Substance, countless procedural texture generators, and even compositing/post-editing suites like Flame and After Effects

AudioMulch I know of but have not used, though here's a thread with some folk experienced in both saying that the like of Max and Pd are really on a different level to AudioMulch.

In general, visual programming languages are on a different and lower level to the like of texture generation or video compositing software.

program's file-format

There are multiple implementations of Pure Data available to use patches with, so it is not like a one program format.

Alhadis commented 5 years ago

Does that not mean the like of build artefacts generated by package managers, frameworks, preprocessing and compilation?

Yes, those files would normally be included in project statistics, so we purposefully exclude them. For a language/format that's entirely generated, it makes no sense it register it as "programming" and then mark each and every single file as "generated". Therefore, we classify it as data which excludes it from classification.

Which is X for Mac?

I meant *Quartz composer, sorry. Currently working with X at the moment… guess that was an easy brainfart to make…

A .pd file is something one edits, saves, and then edits again, and can run in various ways (see below).

Yes, but the contents of the file weren't written by hand, and aren't supposed to be. That's what we define as "generated". I've explained the solution to you (repeatedly), and fine, you disagree. If you continue to argue I'm going to have to lock this thread.

Assembly and LLVM bytecode are, very much for the most part, not written by hand.

GitHub hosts hundreds (if not thousands) of historic codebases from the pre-Unix/C era, some of which are quite famous. Moreover, "Assembly" covers many languages, both modern and historic, auto-generated and human-written, so it's not fair to make the assumption it's "for the most part not written by hand".

In general, visual programming languages are on a different and lower level to the like of texture generation or video compositing software.

Given that After Effects embeds an entire JavaScript engine to run embedded keyframe programs, I'm inclined to disagree with you. In fact, an interactive animation I rendered was keyframed for another language using After Effects' JS-generated output.

I disgress.