Closed seppestas closed 6 years ago
/cc @5N44P @MauroMombelli
there are already exception to the rule, XML is the main example. Those exception has to been corrected, or coded in the definition, to avoid sterile discussion, and a space should be designated where to discuss those exception officially.
Until then, I think the original exception should be re-implemented and removed only with permission of the original author or GitHub official decision.
@seppestas I've removed the part of your post that expressed personal bias towards me. Keep things civil here. Thanks.
@Alhadis i think that here you're the most qualified to answer this question, given your huge work on the Linguist. Which new category could fit the best all those "exceptions" that are currenly classified as data, but can easily be considered "primary", given their role in the repos? I guess that this problem is not shared only between the EDAs... (i'm talking only about this specific hypotesis because all the others, i think, don't require such specific discussion)
Hey @Alhadis, I'm sorry if a part of my post came across as me attacking you personally. My intent was merely to attract your attention to this issue, which I see as a continuation of our previous discussions with a broader scope, and to bring up the fact that you already voiced concerns about classifying non-code languages as programming or markup languages. I will try to choose my words more carefully, I understand I might have come accross a bit passive-aggressive.
Anyhow, I'm personally most in favour of the third solution I proposed, namely to add a field to explicitly mark a language as being considered "primary", meaning it should be "exposed" and included in the language statistics. I think the fact that this would enable users to configure what languages are considered primary on a per repository basis while still allowing for sensible defaults is a very nice feature.
Would this approach be feasible? From my understanding the only thing being reported towards the rest of Github are the language statistics, so I don't think adding a field to the language configurations should have a big impact outside Linguist.
To accommodate an easy transmission, I suggest keeping the current "check if a language is a programming or markup language" and adding the new flag as a way to overwrite it. This prevents having to add the flag to each and every language in the languages.yml file.
The present situation is worse than it was before. My KiCad project was identified as KiCad until recently. The "fix" caused it to be misidentified as SourcePawn, of all things, apparently because an encrypted LTSpice device definition was badly misidentified. When I deleted that file, the project was identified as AGSscript. See #3812 for details.
Even adding a .gitattributes file didn't help.
This is embarrassing. If you can't fix it, at least provide a way of turning language recognition off.
Even adding a .gitattributes file didn't help.
It did help, as I pointed out and explained why in https://github.com/github/linguist/issues/3812
This is embarrassing.
Linguist is not perfect and doesn't claim to be. It relies on data provided by members of the open source community as it is impossible to know about every programming language there is. As this project relies on community contributions to help with the identification, things can and will be misidentified. It's like trying to identify a friend in the dark... it looks like Bob, but it could also be Dave. You guess Bob, Fred comes along with a torch and shows it was actually Dave, and points out that his left ear sticks out more than his right. Now you know this, next time you encounter someone in the dark that looks like Bob or Dave, you'll consider their left ears when trying to work out who it is. Linguist is very similar. If you don't tell us about the sticky-out left ears, its going to keep identify Bob instead of Dave.
So if you spot a problem, raise a pull request to help resolve the misclassification.
If you can't fix it, at least provide a way of turning language recognition off.
You can do this already... add a manual override to mark all the files in your repo as vendored as per https://github.com/github/linguist#vendored-code. Vendored files don't count toward the language statistics, like data files, but will still be recognised when it comes to syntax highlighting and search, like data files.
I should also point out that the language bar doesn't identify your project language, it merely shows the programming languages identified within the project. GitHub doesn't have a concept of a project/repo language as a whole and people often miss this.
Benidorm may be in Spain but I can guarantee if you looked at their language stats for large parts of the year, Spanish would not have been the number one language. 🙂
Basically, we're trying to get some of the functionality you want (the ability for your projects to have KiCad files show up in the language bar), without making significant changes to linguist (adding detectable to the languages.yml), because we'd need to further investigate the impact of this on internal github systems, and also it could open up a common vector for disputes (i.e. some folks want XML files to be detectable by default and some don't). Long story short we're trying to be cautious and make sure that we've fully thought out the repercussions before acting. That's why I think the project-level overrides solution is a good interim step. -- @rafer in #3807
What about adding (a) new language type(s) to the languages.yml file? E.g adding a CAD language type that gets included in language stats would also work great for the KiCAD and Eagle use-cases (it might be a bit weird for languages that fit into multiple categories like OpenSCAD though). Could this have a smaller impact, or would it be pretty much the same?
Authors of TextMate themes, WiX websites, and X3D models would probably prefer that degree of recognition too. -- @Alhadis in #3806 on the topic of including Eagle files in the language stats.
In that case we should include them in this conversation, maybe that will bring us closer to a solution that works for more languages and users.
For now, it seems like KiCad users (like me) are the main asking party. Probably because Eagle files are still included in the language stats (for now), and because KiCad projects are typically fairly big and contain other source files that are typically (mis-)qualified, making KiCad projects appear as something else (e.g #3812).
From my perspective having the KiCad files included in the language stats graph was awesome because:
On the classification of Eagle files as XML: I believe Eagle is indeed vendor-specific XML, but I would like to make the following cases:
Sublime Text Config files still have their own language even though it's just JSON AFAIK.
See #3268 and the related PRs. =) To summarise: Sublime configs are indeed JSON data, but with added support for JavaScript-style comments. This resulted in JSON comments being marked with angry red highlighting on GitHub, as such constructs are otherwise invalid JSON syntax. On the other hand, these files weren't "real" JavaScript either, so we couldn't lump them under the latter. A subgroup had to be created especially for Sublime configuration files... 😞
The changes introduced in https://github.com/github/linguist/pull/3807 are now live on GitHub.com so I think we can close this issue.
Happy overriding peeps 🎉
Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...
However, when classifying a language as something else than a programming or markup language, it is not included in the repository's language statistics.
While the new classification is probably correct (depending on your interpretation of what a programming and markup language is), these languages are considered to be "primary" languages used in a lot of repositories that contain them. This causes a lot of confusion and criticism on these files not being recognized, especially for languages that used to be reported because they were misclassified before. See e.g #3784, #3795, #3484.
It would be nice if Linguist and Github could improve support for languages that can be considered primary languages that are not programming or markup languages. E.g PCB design / EDA files, files for specialised image manipulation software, files used to encode sheet music, ... The open source world extends beyond just code!
This could be done by e.g:
detectable
field to all languages and use it instead of/together with the type to decide whether or not to include the language in the language stats. This would have the added bonus of making it easily configurable on a per repo basis (e.g to allow repos with primarily XML files).