github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.15k stars 4.21k forks source link

Kicad #3784

Closed valerionew closed 7 years ago

valerionew commented 7 years ago

I've recently had some issue getting one of my repositories recognised as kicad by GitHub. Then i started looking arround into the linguist and found some recent updates in the kicad handling process. See #3765 and #3743.

So, as a KiCad user: what's the point of doing that? It doesn't make much sense. Kicad files usually travel together. While it's possible to have a repository made out of only layouts or schematics, that usually doesn't make much sense.

Moreover there are some important kicad files that are at the moment unclassified or misclassified:

But, as i said, all those files make sense together, and they all form a single kicad project.

So, in concusion, i think that all the kicad files should be classified in just a plain KiCad language. Adding or moving the extension in the current system doesn't make any sense: a kicad project is made of all these files, together. Would you classify differently the .c, .cpp and .h files from a c/c++ library?

Alhadis commented 7 years ago

Hey @5N44P,

A repository's classification is drawn from the languages of the files it's comprised of. It's not (currently?) possible to manually declare the language of a repository, although several users have requested such a feature in the past, I believe.

I understand this doesn't play well with KiCad projects, which are comprised of several different file formats, most of which are now classed as data-only files (for reasons I explained in #3795). However, this really isn't something we can change, because it's simply the way GitHub categorises projects.

.pretty folders: unclassified. These are footprint libraries. But i don't know if it will be possible to classifiy a folder.

You're right: it isn't possible to classify files based on their directory.

.kicad_mod files: misclassified. These are the single footprints. Usually these file are the content of a .pretty folder and they are the components available from the library. Now they are classified in "KiCad Layout", which isn't really the best option.

What would you suggest instead? Bear in mind, I used KiCad Layout to mean "any KiCad-related project file", so it could just as easily have been named KiCad instead.

.pro files: misclassified. These are project files, currently they are recognized as INI files.

From what I can see, the files use a configuration format indistinguishable to INI, so it seems reasonable to mark them as such. Otherwise, you're implying that there exists a configuration format defined by KiCad for .pro files that just happens to be extremely similar to INI files...

Now, as for the other formats... well, I added only what I could find in KiCad's file format docs, and only those which met the in-the-wild usage criteria required by Linguist for new filetype additions.

valerionew commented 7 years ago

Hi @Alhadis, Thank you again for your response, but to me is still unclear why splitting kicad projects in three different languages. I understand that the work of the Linguist is mainly related to the form and not on the significance of the files, but I'm not sure that this is giving the best results. Isn't a language based on both the form and a significance?

By the way I can't understand the in-the-wild thing. These files are generated with each and every kicad project, and are required to operate correctly.

Is Github trying to discourage us to share our OSHW projects through their website?

Thank you for your time

Alhadis commented 7 years ago

I should stress that I'm not GitHub staff, so any explanation I give shouldn't be misconstrued as authoritative.

but to me is still unclear why splitting kicad projects in three different languages.

Well, you tell me: how would you have handled this? How many "languages" do you believe are necessary to accurately and all-inclusively accommodate every hosted KiCad project file?

valerionew commented 7 years ago

Sorry, i'm not a programmer and i can't really understand how linguist works on the inside. But woulnd't be one single language, "KiCad", enough to make a classification?

seppestas commented 7 years ago

@Alhadis I think the main point @5N44P (or at least me) is trying to make is that splitting KiCad source files has very little benefit, while classifying all files under one langues does make a lot of sense because:

On the topic of classifying the files as data / code / layout: even if it makes more sense to classify these files as data (which I still don't agree with), we would very much like to have the KiCad files show up in the language stats, since for a lot of our repo's they are the most important files. I think user-friendliness is more important than semantical correctness.

MauroMombelli commented 7 years ago

@Alhadis answering here some point from the other issue #3795 as that issue is about something else.

Being able to search using a data-type language is probably a bug with GitHub's UI... :confused:

you THINK but at the same time you broke the functionality before ask. I THINK that searching for data-type has been coded, and so it is a clearly wanted mechanism. Anyway, you are actually breaking a user experience, at least some data on HOW MUCH is used should have been collected. I don't know if this has been fully understood at the time of the merge, considering your answer is a side effect that hasn't been considered.

its definition of markup is quite literally document-specific

so schematics, layout and such are NOT document? where i can read a hard definition on what is a "document"?

Some of them, like the sparkfun eagle-libraries (https://github.com/sparkfun/SparkFun-Eagle-Libraries) are literally libraries of symbols. Killing the classification would make them at least hard to dig out.

I think the underlying issue here is not being able to override the classification of a repository

please explain, that is not clear for me.

we would very much like to have the KiCad files show up in the language stats, since for a lot of our repo's they are the most important files

This.

as definition of "data" i propose "Everything that still does not have a parser". After all, XML you know the container but not the content (unless some non-obvious and possibly wrong interpretation), so it cant have a "data" parser; markdown, schematics and code are directly usable by the end user (the developer), so they can have a parser. The day someone write a parser to understand the content of the XML are telephone number or cooking receipt, then they will have their own parser and category.

seppestas commented 7 years ago

Actually, the more I think about it the more I agree that KiCad files are not markup files. To answer my own question: the textual contents of a e.g a KiCad layout is not typically rendered, while the text of an HTML file is rendered, the "markup" is just added for semantics.

Either way, having KiCad files being ignored in language statistics is a shame. I think the proper solution is to come up with a "type" that is considered a "main" language (i.e counted in the language statistics). I'm going to create a separate issue for this.

Alhadis commented 7 years ago

@MauroMombelli, your tone is neither warranted or welcomed. Please calm down, or excuse yourself from the discussion.

so schematics, layout and such are NOT document? where i can read a hard definition on what is a "document"?

I mean "document" in the human sense of the word. So the dictionary will give you the hard definition: "Any material substance on which the thoughts of people are represented by any species of conventional mark or symbol."

Either way, having KiCad files being ignored in language statistics is a shame. I think the proper solution is to come up with a "type" that is considered a "main" language (i.e counted in the language statistics). I'm going to create a separate issue for this.

Yes, I understand this is the real issue. You wouldn't be the first to complain about it. We've had users voice complaints over "missing" language bars for Markdown- or XML-only repositories; one user even had a repository full of .editorconfig files. Needless to say, GitHub can't hope to please every user...

MauroMombelli commented 7 years ago

@Alhadis first sorry if you consider my tone harsh, is not meant that way. I think if you exclude the "tone" you can see those are legitimate questions, only partially answered.

"Any material substance on which the thoughts of people are represented by any species of conventional mark or symbol."

so electric schema and schematics are exactly this; component symbols and layout are actually more "conventional" than many spoken languages, as they are understood in all the world. Yes, reading them without the proper editor can be quite hard, as you have to know the specific underline representation, but the same can be said for a text document without a text editor (maybe with some non standard ASCII character) or an html without a browser. Or pdf, "the" format for official human documentation, AFAIK has not its special bar.

Needless to say, GitHub can't hope to please every user...

but we are comparing borderline project against the ability to look for quite common project type, especially now that IOT and open hardware is becoming more and more common.

valerionew commented 7 years ago

I'm sorry @Alhadis but i can't agree. You can't equate a KiCad repository to a Markdown repository, to a XML repository or even a .editorconfig. They are not even close. A markdown repository is a generic repo of text with some markup. An XML file is the same, who knows what goes inside. With kicad is totally different. First of all, kicad files refer to an unique use and an unique software. In markdown, xml or a dotfile you can write whatever you want. KiCad is far more specific, and more common. Moreover, a KiCad project is something huge, compared to a markdown or a XML file. I mean... it's an entire hardware project!

I think that this is not the same case, and something has to be done to fix this situation, it can't be dismissed with a "GitHub can't hope to please every use". I hope we'll get your's and Github's collaboration to fix it...

Alhadis commented 7 years ago

@5N44P That's beside the point. The point is, all of these languages are either data or prose types, and you two haven't been the only users inconvenienced by the way GitHub represents such things.

Moreover, a KiCad project is something huge, compared to a markdown or a XML file. I mean... it's an entire hardware project!

FYI, Android apps are written using XML, and there are many documentation-only projects on GitHub that're written entirely in Markdown. But never mind, that's beside the point...

If there's a solution to be had, it would probably be to add a "Language:" field to a repository's settings (exactly the way BitBucket does). This would allow users to pick languages for their repository even if the selected language is data-typed, whilst not skewing the usage stats of existing projects.

(I hope this makes sense)

seppestas commented 7 years ago

If there's a solution to be had, it would probably be to add a "Language:" field to a repository's settings (exactly the way BitBucket does). This would allow users to pick languages for their repository even if the selected language is data-typed, whilst not skewing the usage stats of existing projects.

Personally I prefer the way Github handles multi-language projects, since my use-case (or at least the use-case of the company I work for) is mostly repositories containing both KiCad design files and firmware source files.

@Alhadis Do you think it would be acceptable to re-classify PCB design project files (like KiCad and Eagle projects) as "programming" or "layout" files? I agree it's semantically not correct, but as far as I understand it's the easiest way to have these types of repositories recognized in a usable way until a better approach is found. I also prefered having all KiCad files recognized as a single language.

Also, sorry for the harsh comments in this thread, but try to understand that you kind of broke the language recognition for the primary source files in a lot of the afflicted repositories because of the changes in #3743. I must admit you also triggered a minor nerd rage in me when I looked at the PR (but I worked it out on a college instead XD). You don't seem to actively use KiCad yourself, so you are not yourself impacted, meaning:

You might consider KiCad as some outskirt of the languages.yml file no one has touched in years, but there is a large and diverge community using it everyday for different purposes. Imagine your Javascript projects suddenly being recognized as Makefiles projects.

Also please try to keep comments on a PRs on topic, and maybe try to involve someone using the language (like the people who made the open source projects you used for testing) when making these type of changes.

/rant

Alhadis commented 7 years ago

You don't have a clear reason for making the changes apart from some semantic details

*sigh Guys, I'm a regular contributor to this repository, and I was simply fixing what I knew* were errors.

It was a mistake to classify KiCad as "programming" in the first place. You're all upset because the "correct" way happens to inconvenience you. If we knowingly make an exception to you guys and leave KiCad tagged as a "programming" language, what are we to say to the next user who complains their platform-specific data files are being ignored by GitHub? "Sorry, but the KiCad guys were angrier, tough luck"...?

Now, instead of whining, how about submitting a pull-request - or at the very least, giving a comprehensive list of what's currently missing?

Honestly, I didn't expect angsty snark when I submitted that PR. You're welcome for the spiffy syntax highlighting, BTW.

seppestas commented 7 years ago

I understand you're just a contributor, and I'm very greatfull you are trying to fix things and maintain proper support for KiCad in linguist, but things used to work fine for us and now they don't, that's why we're a bit upset...

I agree, KiCad is neither a programming language nor markup language, but having it classified as data prevents it from being recognized as primary language. I think the only proper fix requires more fundamental changes I plan to propose, but for now classifying the files incorrectly would work as a hack.

Thanks for the syntax highlighter, I'm planning something even cooler that might one day work on Github, but first https://github.com/isaacs/github/issues/1005 needs to be addressed ^^.

On 28 Aug 2017, 23:22 +0200, John Gardner notifications@github.com, wrote:

You don't have a clear reason for making the changes apart from some semantic details sigh Guys, I'm a regular contributor to this repository, and I was simply fixing what I knew were errors. It was a mistake to classify KiCad as "programming" in the first place. You're all upset because the "correct" way happens to inconvenience you. If we knowingly make an exception to you guys and leave KiCad tagged as a "programming" language, what are we to say to the next user who complains their platform-specific data files are being ignored by GitHub? "Sorry, but the KiCad guys were angrier, tough luck"...? Now, instead of whining, how about submitting a pull-request - or at the very least, giving a comprehensive list of what's currently missing? Honestly, I didn't expect angsty snark when I submitted that PR. You're welcome for the spiffy syntax highlighting, BTW. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Alhadis commented 7 years ago

but things used to work fine for us and now they don't, that's why we're a bit upset...

Yes, because you were all used to it. Ask yourselves: had KiCad been correctly classified in the first place, would you still be as upset as you are now...?

I agree, KiCad is neither a programming language nor markup language, but having it classified as data prevents it from being recognized as primary language.

Just curious, how exactly does that impact your project...? Don't forget, there's a topics feature you can use to mark a project as being KiCad-related.

valerionew commented 7 years ago

The main doubt is this: When things were formally incorrect, they worked. Now things are formally correct, but they don't work anymore. Which one of these is the right one? I'd say the working one... That's what we all are trying to say: usability and working features should come first, according to us 😄

BTW I wish I could help with a PR, but as I said i'm not a programmer. I can help responding questions on kicad, or do the layout of electronic boards... Maybe help with a little bit of firmware, nothing more...

Alhadis commented 7 years ago

Okay @5N44P, I want you to envision the following hypothetical scenario:

  1. A new CAD program is recently developed that gains traction across the FOSS community, and saves its data in a plain-text format similar to a KiCad schematic.
  2. A user submits the CAD program's project files to Linguist as a new language addition. The language is added as data.
  3. It gets accepted, and goes live on GitHub.
  4. User submits issue to Linguist asking why their projects aren't being categorised as "CadXYZ" projects.
  5. Contributors explain what I finished explaining earlier, about GitHub's 4 language types, and which types are considered when calculating usage stats
  6. Users asks why KiCad is added as "programming" when it's fundamentally equal to CadXYZ, which is added as "data"
  7. Maintainers now have to think of some contrived rhetoric to justify the discrepancy

Don't you see how many complications this could lead to...? How are we to justify to another crowd why KiCad is considered special, but another similar format isn't?

valerionew commented 7 years ago

I understand that, but why not suggesting to github to make some changes to its 4 language types (I think that we all agree that it has some serious limitations, specially if it's rules are applied so strictly). And in the meantime patch the things by considering CADs like kicad as programming, just for the limited period of time before github changes it's language policy.

And when a new CAD comes around, it will benefit the new github language system. If new CADs pop up in the meantime, they won't be added as a programming because it's just a temporary patch

Alhadis commented 7 years ago

… just for the limited period of time before github changes it's language policy. … they won't be added as a programming because it's just a temporary patch.

You're assuming they ever will. It's up to site staff to make that decision, not us.

If you have further concerns or suggestions about how GitHub indexes repositories, take it up with site support and give them your feedback.

valerionew commented 7 years ago

Of course it's up to the staff, it is an hypothesis. I think that it's pretty evident that the current system has some weak points, specially considering that the OSHW community keeps growing, there will be more and more hardware+software projects. And github it's called to make a decision: embrace also the hardware development, or completely remove the support for pcb from the website. A half-kinda-sorta support doesn't seem to have much sense.

But as I said, I'm just making hypothesis, trying to figure out a solution that makes everyone happy.

Alhadis commented 7 years ago

Of course things aren't perfect. There's a lot that can be improved on GitHub that is, sadly, beyond our ability to fix. We'll just have to have faith that things will only continue to get better in time with user feedback.

valerionew commented 7 years ago

So, can we open a dialogue with someone from Github staff here? Maybe we should first agree, between us, on what would be the best solution to address this problem, and then ask Github's staff what they think about it

MauroMombelli commented 7 years ago

Users asks why KiCad is added as "programming" when it's fundamentally equal to CadXYZ, which is added as "data"

if it is a "new CAD program is recently developed that gains traction across the FOSS community" i would add it. Like if it is the new cool language like javascript or a joke language like Brainfuck.

As long as the thing has some traction and user, what would be the PROBLEM to tag it, aside from a formalist definition (open to interpretation)

You're assuming they ever will. It's up to site staff to make that decision, not us.

and

sadly, beyond our ability to fix

but in the PR you broke it.. it mean someone added it, and multiple time, and for different editor. you may not agree that kicad are markup, but at least there is a gray area there.

Alhadis commented 7 years ago

So, can we open a dialogue with someone from Github staff here?

/cc @lildude @vmg Could one of you gents please deal with these guys?

@MauroMombelli commented an hour ago

Mauro, I'm actually ignoring your responses because you're not making a word of sense, and I get the feeling you've only joined in on this conversation to point fingers and play the blame-game. The fact you use the word "break" with relation to recent changes proves you don't actually know what Linguist is or how it even integrates with the site...

valerionew commented 7 years ago

@Alhadis i think that the problem is that me and Mauro are not native English speakers. English is not our first language, so our command is not the best. This can result in some misunderstanding, especially about the tone, which you can't really "learn" in school... Sorry about that!

seppestas commented 7 years ago

To respond to the hypothetical scenario of a new CAD package gaining traction on Github: yes, I would like to see the files of that package being supported properly by Linguist as well. Hopefully the developers of the CAD package would make some effort to make it at least a bit distinguishable from KiCad schematic files, but if they don't... nobody said Linguist's job is easy.

I still don't see the motivation for classifying source files like the ones used in CAD packages as data apart from puristic semantical reasons. If we are going that route, you might also classify languages like VHDL and Verilog as data because they are technically not really programming languages but hardware description languages (please don't do this, just trying to make a point). In most of our repos, the KiCad source files are the "primary" source file, not a supporting data file like e.g YML in Ruby projects or XML in Android projects.

What would be the negative impact of having CAD source files being (by express mis)classified as programming or layout languages? Because the positive impact is pretty big IMO:

seppestas commented 7 years ago

Also, I think we can all agree that this is sad to see as a KiCad user:

screen shot 2017-08-29 at 09 20 29

We want something like this:

screen shot 2017-08-29 at 09 20 52

This also shows why I think it makes sense to qualify all KiCad source files (apart from maybe library files) as the same type.

seppestas commented 7 years ago

Wait, actually, this seems to work:

screen shot 2017-08-29 at 09 29 29

It might make sense to flag "KiCad Board" as the old, deprecated layout file type.

Alhadis commented 7 years ago

Alright... I want everybody to momentarily forget about what is and what isn't visible to users in GitHub's interface, and to focus only on the strict, formal definition of the languages we index and classify. Remember too, that GitHub's indexing and classification strategies aren't set in stone... there is a possibility they refine their logic and calculate usage statistics with logic that accounts for previously "neglected" languages like XML and KiCad. This could happen, or it might not.

Linguist's responsibility is to identify files as being of a specific language, and selecting a grammar to provide syntax highlighting. How that classification is used by the site is beyond the scope of this project. All Linguist needs to do is identify files accurately and factually... and certainly without any bias towards a subset of users, as has been implied in this thread:

And in the meantime patch the things by considering CADs like kicad as programming, just for the limited period of time before github changes it's language policy.

We don't play favourites or give special treatment, because that would clearly be unfair. ;)

If a format or language is realistically data, and can by no stretch of the imagination be considered a "programming language" or "markup", then Linguist should list it as such. Simple as that. Its role in the bigger picture is actually quite small: how classifications can affect the visibility of a project's files is very much outside the scope of its responsibility.

valerionew commented 7 years ago

Okay @Alhadis, but i wasn't asking for a special treatment for kicad. I was asking for a temporary fix, for all the CADs, in case that Github staff would have agreed with us that the current classification has some problems that need to be addressed.

That said, wouldn't you agree that the current split into three kicad sub-languages is not the best, given that factually all the files are used as one project?

seppestas commented 7 years ago

There could actually be quite some advantages to separating the language definitions of the different KiCad file formats:

That being said, I still think it makes more sense to be able to search for projects using KiCad instead of searching for projects that just use e.g KiCad layouts.

My main gripe is the fact that the KiCad (and Eagle) files have been classified as data, preventing them from showing up in the language statistics.

Alhadis commented 7 years ago

You all need to stop with this one project thing. Please. I understand there are two different worlds colliding here - software developers and hardware developers, but try to understand GitHub doesn't categorise repositories using a generalisation. It does so by evaluating the language of each separate file, irrespective of whatever else it happens to share a directory with. Imagine the file being taken aside one at a time, and evaluated without any knowledge of where it came from. That's ultimately what Linguist does for each repository.

It then averages those results and uses whatever programming- or markup-type language has the highest usage ratio to identify the project. Of course that's not 100% foolproof. Heck, if I were the one to decide how GitHub calculates repository stats, it'd be a whole different story (starting with the ability to override the calculated language type with a user-assigned one, a la BitBucket). As I'm not, there's no use complaining to me. Voice your concerns to GitHub site support, as I suggested.

Alhadis commented 7 years ago

Also understand that changing the way projects are currently classified to benefit hardware developers would have a negative impact on many existing projects. GitHub is huge and caters to a lot of different audiences publishing projects of vastly different natures. Some of which are a better fit for the way it hosts data than others... as it is, KiCad projects aren't such an easy fit...

seppestas commented 7 years ago

What negative impact would classifying KiCad projects as "programming" or "markup" have on other projects? It's just a matter of semantics isn't it?

Alhadis commented 7 years ago

@seppestas Please read my earlier response regarding unbiased identification and classification of languages. A PCB project's files are clearly not executable source code, and the amount of "lawyering" I'm seeing here from users trying to twist the site's definition of language types is bringing me close to closing and locking this issue.

We don't play favourites, and yes, I admit the current logic used by the site's language analysis won't suit everybody. This won't change by complaining here. I recommend directing your complaints to site staff.

Otherwise, yeah, it can be considered a case of "semantics"... though I prefer "factual accuracy".

MauroMombelli commented 7 years ago

I dont have time today but from kicad docs:

Mauro:

File Formats

KiCad writes all files in human readable ASCII. This makes manipulation by hand and scripting very easy. The following is a listing on what the different files are used for.

So why they should not be considered markup? Also they should be all "kicad", like header and source file are both C or C++

If someone can't understand what I say, please ask for clarification instead if ignoring me, thanks.

On Tue, Aug 29, 2017, 12:26 John Gardner notifications@github.com wrote:

@seppestas https://github.com/seppestas Please read my earlier response https://github.com/github/linguist/issues/3784#issuecomment-325611592 regarding unbiased identification and classification of languages. A PCB project's files are clearly not executable source code, and the amount of "lawyering" I'm seeing here from users trying to twist the site's definition of language types is bringing me close to closing and locking this issue.

We don't play favourites, and yes, I admit the current logic used by the site's language analysis won't suit everybody. This won't change by complaining here. I recommend directing your complaints to site staff.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/github/linguist/issues/3784#issuecomment-325622680, or mute the thread https://github.com/notifications/unsubscribe-auth/AGE80dIKY7B-otauD-X-QZiHDzV1y1lsks5sc-drgaJpZM4O7-2q .

seppestas commented 7 years ago

I understand your earlier response, and I agree: KiCad and Eagle source files are not programming or markup languages. However, it is clear that the Github UI handles languages classified as data in a special way, namely it hides it from a project's language overview.

To my knowledge there is no formal definition as to what Github / Linguist considers as a programming or markup languages. These definitions are already subject of much debate (I mean, XML stands for Extensible Markup Language, and is classified by Linguist as data, and Hardware description languages are considered programming languages) so I think there is no real black-and-white definition here.

We are (or at least I am) not asking to create a special exception for KiCad or even CAD design / EDA packages. We just want source files that can be considered primary source files of a project, like KiCad files and Eagle files, to show up in the language statistics.

If these files can be easily distinguished from other data files (which is the case for e.g KiCad layouts, but maybe not for the XML used in Android projects) and if this does not cause harmful behaviour for other projects, I don't see the problem in classifying these types of files as programming or markup languages, until a more semantically correct solution is provided (which will probably take a long time).

Alhadis commented 7 years ago

Okay, for the last time: What you're referring to is not what this repository is responsible for. This is an issue with site UI that's best taken up elsewhere. It has nothing to do with Linguist. This is only a library for categorising and identifying languages. How that classification is used by GitHub is handled elsewhere, and has nothing to do with this codebase.

This is going in circles, and the points being brought up have already been addressed in earlier responses. I'm going to lock this, at least for now, in order to usher relevant discussion where it's needed: see github/linguist/pull/3799.

Those of you with additional complaints are advised to voice your concerns to site support, as I tried explaining earlier.

@MauroMombelli See https://en.wikipedia.org/wiki/Markup_language#Types