Improve support for non-code languages

seppestas commented 7 years ago

Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...

However, when classifying a language as something else than a programming or markup language, it is not included in the repository's language statistics.

While the new classification is probably correct (depending on your interpretation of what a programming and markup language is), these languages are considered to be "primary" languages used in a lot of repositories that contain them. This causes a lot of confusion and criticism on these files not being recognized, especially for languages that used to be reported because they were misclassified before. See e.g #3784, #3795, #3484.

It would be nice if Linguist and Github could improve support for languages that can be considered primary languages that are not programming or markup languages. E.g PCB design / EDA files, files for specialised image manipulation software, files used to encode sheet music, ... The open source world extends beyond just code!

This could be done by e.g:

Adding new types for languages like EDA tools and other languages that can be considered "primary"
Making Linguists' definition of programming languages and/or markup languages very loose so these languages can be classified as one of these types
Adding a more explicit "include in language stats"/detectable field to all languages and use it instead of/together with the type to decide whether or not to include the language in the language stats. This would have the added bonus of making it easily configurable on a per repo basis (e.g to allow repos with primarily XML files).

seppestas commented 7 years ago

/cc @5N44P @MauroMombelli

MauroMombelli commented 7 years ago

there are already exception to the rule, XML is the main example. Those exception has to been corrected, or coded in the definition, to avoid sterile discussion, and a space should be designated where to discuss those exception officially.

Until then, I think the original exception should be re-implemented and removed only with permission of the original author or GitHub official decision.

Alhadis commented 7 years ago

@seppestas I've removed the part of your post that expressed personal bias towards me. Keep things civil here. Thanks.

valerionew commented 7 years ago

@Alhadis i think that here you're the most qualified to answer this question, given your huge work on the Linguist. Which new category could fit the best all those "exceptions" that are currenly classified as data, but can easily be considered "primary", given their role in the repos? I guess that this problem is not shared only between the EDAs... (i'm talking only about this specific hypotesis because all the others, i think, don't require such specific discussion)

seppestas commented 7 years ago

Hey @Alhadis, I'm sorry if a part of my post came across as me attacking you personally. My intent was merely to attract your attention to this issue, which I see as a continuation of our previous discussions with a broader scope, and to bring up the fact that you already voiced concerns about classifying non-code languages as programming or markup languages. I will try to choose my words more carefully, I understand I might have come accross a bit passive-aggressive.

Anyhow, I'm personally most in favour of the third solution I proposed, namely to add a field to explicitly mark a language as being considered "primary", meaning it should be "exposed" and included in the language statistics. I think the fact that this would enable users to configure what languages are considered primary on a per repository basis while still allowing for sensible defaults is a very nice feature.

Would this approach be feasible? From my understanding the only thing being reported towards the rest of Github are the language statistics, so I don't think adding a field to the language configurations should have a big impact outside Linguist.

To accommodate an easy transmission, I suggest keeping the current "check if a language is a programming or markup language" and adding the new flag as a way to overwrite it. This prevents having to add the flag to each and every language in the languages.yml file.

John-Nagle commented 7 years ago

The present situation is worse than it was before. My KiCad project was identified as KiCad until recently. The "fix" caused it to be misidentified as SourcePawn, of all things, apparently because an encrypted LTSpice device definition was badly misidentified. When I deleted that file, the project was identified as AGSscript. See #3812 for details.

Even adding a .gitattributes file didn't help.

This is embarrassing. If you can't fix it, at least provide a way of turning language recognition off.

lildude commented 7 years ago

Even adding a .gitattributes file didn't help.

It did help, as I pointed out and explained why in https://github.com/github/linguist/issues/3812

This is embarrassing.

Linguist is not perfect and doesn't claim to be. It relies on data provided by members of the open source community as it is impossible to know about every programming language there is. As this project relies on community contributions to help with the identification, things can and will be misidentified. It's like trying to identify a friend in the dark... it looks like Bob, but it could also be Dave. You guess Bob, Fred comes along with a torch and shows it was actually Dave, and points out that his left ear sticks out more than his right. Now you know this, next time you encounter someone in the dark that looks like Bob or Dave, you'll consider their left ears when trying to work out who it is. Linguist is very similar. If you don't tell us about the sticky-out left ears, its going to keep identify Bob instead of Dave.

So if you spot a problem, raise a pull request to help resolve the misclassification.

If you can't fix it, at least provide a way of turning language recognition off.

You can do this already... add a manual override to mark all the files in your repo as vendored as per https://github.com/github/linguist#vendored-code. Vendored files don't count toward the language statistics, like data files, but will still be recognised when it comes to syntax highlighting and search, like data files.

lildude commented 7 years ago

I should also point out that the language bar doesn't identify your project language, it merely shows the programming languages identified within the project. GitHub doesn't have a concept of a project/repo language as a whole and people often miss this.

Benidorm may be in Spain but I can guarantee if you looked at their language stats for large parts of the year, Spanish would not have been the number one language. 🙂

seppestas commented 7 years ago

Basically, we're trying to get some of the functionality you want (the ability for your projects to have KiCad files show up in the language bar), without making significant changes to linguist (adding detectable to the languages.yml), because we'd need to further investigate the impact of this on internal github systems, and also it could open up a common vector for disputes (i.e. some folks want XML files to be detectable by default and some don't). Long story short we're trying to be cautious and make sure that we've fully thought out the repercussions before acting. That's why I think the project-level overrides solution is a good interim step. -- @rafer in #3807

What about adding (a) new language type(s) to the languages.yml file? E.g adding a CAD language type that gets included in language stats would also work great for the KiCAD and Eagle use-cases (it might be a bit weird for languages that fit into multiple categories like OpenSCAD though). Could this have a smaller impact, or would it be pretty much the same?

seppestas commented 7 years ago

Authors of TextMate themes, WiX websites, and X3D models would probably prefer that degree of recognition too. -- @Alhadis in #3806 on the topic of including Eagle files in the language stats.

In that case we should include them in this conversation, maybe that will bring us closer to a solution that works for more languages and users.

For now, it seems like KiCad users (like me) are the main asking party. Probably because Eagle files are still included in the language stats (for now), and because KiCad projects are typically fairly big and contain other source files that are typically (mis-)qualified, making KiCad projects appear as something else (e.g #3812).

From my perspective having the KiCad files included in the language stats graph was awesome because:

It made it possible to quickly see the difference between pure (firmware) source code repos, repos containing both PCB design files and source code and repos containing only PCB design files
The language stats graph is a handy way to navigate to the different file types, especially in repos with an unfamiliar directory structure
It allowed to see at a glance what type of CAD package(s) is/are used in a project

seppestas commented 7 years ago

On the classification of Eagle files as XML: I believe Eagle is indeed vendor-specific XML, but I would like to make the following cases:

Including Eagle as an extension of XML would prevent methods like adding a "detectable" field for languages or adding a CAD language type from working. I would really like to keep Eagle files being included in the language stats.
XML and JSON files are not the only type of files being used in a "vendor-specific" way and some XML and JSON files still have their own language. Examples include:
- Bitbake .bb files, basicaly SH shell / Python being recognized as the Bitbake language
- Sublime Text Config files still have their own language even though it's just JSON AFAIK.
- Probably some other things, but it's time for lunch now

Alhadis commented 7 years ago

Sublime Text Config files still have their own language even though it's just JSON AFAIK.

See #3268 and the related PRs. =) To summarise: Sublime configs are indeed JSON data, but with added support for JavaScript-style comments. This resulted in JSON comments being marked with angry red highlighting on GitHub, as such constructs are otherwise invalid JSON syntax. On the other hand, these files weren't "real" JavaScript either, so we couldn't lump them under the latter. A subgroup had to be created especially for Sublime configuration files... 😞

lildude commented 6 years ago

The changes introduced in https://github.com/github/linguist/pull/3807 are now live on GitHub.com so I think we can close this issue.

Happy overriding peeps 🎉

github-linguist / linguist

Improve support for non-code languages #3805