github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
11.95k stars 4.14k forks source link

jBPM Project marked as Visual Basic #4355

Closed vinischeidegger closed 5 years ago

vinischeidegger commented 5 years ago

Projects like sample-dashboard-thymeleaf which uses several jBPM files (.bpmn2, .frm and *.wid) are being marked as Visual Basic instead of Java

Preliminary Steps

Please confirm you have...

Problem Description

I forked the repo to post a correction (I also added an override to mark it as Java - which worked fine). But I believe the logic of the linguist should be enhanced to consider these files as part of jBPM when used together with Maven POM for instance.

URL of the affected repository:

https://github.com/business-applications/sample-dashboard-thymeleaf

Last modified on:

2018-10-10

Expected language:

Java

Detected language:

Visual Basic

lildude commented 5 years ago

That repo is being identified as predominantly Visual Basic because of the .frm files and they account for the vast majority of the repo by bytes of code, and by a long way:

Visual Basic 9.57 KB HTML 3.83 KB Shell 3.61 KB Batchfile 2.33 KB Java 1.97 KB

The only language associated with that extension is Visual Basic:

https://github.com/github/linguist/blob/fa493000a594f5fbd457bbd473ce791d95b227cc/lib/linguist/languages.yml#L5062-L5069

… hence the classification. The content of the files themselves actually looks like JSON so even if we were to extend Linguist to support this extension with another language, it would be JSON, not Java, and JSON isn't counted towards the language stats by default as this is considered data. This would then make the repo predominantly HTML and not Java - GitHub doesn't have a concept of a "repo language", only "this repo is made up of these languages and this is predominant language".

But I believe the logic of the linguist should be enhanced to consider these files as part of jBPM when used together with Maven POM for instance.

Unfortunately, that would require a complete rewrite of the way Linguist works. Linguist currently considers the language of each file in isolation. Having greater repo-level context would be nice but it's not a simple matter to implement and would definitely lead to more contentious discussions than the current behaviour does.

So all things considered, short of associating .frm with JSON, which will require a heuristic to differentiate it from VB, the only other option is to implement an override as you've done.

pchaigno commented 5 years ago

Thanks for looking into this @vinischeidegger!

Projects like sample-dashboard-thymeleaf which uses several jBPM files (.bpmn2, .frm and *.wid) are being marked as Visual Basic instead of Java

At the moment, these file extensions are not associated with Java. Only .frm is associated with a language, Visual Basic. We'd welcome a pull request to add these extensions to Java, if they meet the in-the-wild usage requirement.

tsurdilo commented 5 years ago

The use of .frm is specific to the redhat process automation manager (workbench). Used as extension for process/task forms. @vinischeidegger, I agree its unfortunate that these are marked as visual basic but i think we have to live with that probably ill-chosen extension for now (until its possibly changed in the future). Thanks for checking on this tho!

vinischeidegger commented 5 years ago

Thank you all for looking into this. I agree it may have been a poor decision from jBoss to use *.frm, when they probably could have used json.

I agree with @lildude that we could also associate the extension with JSON and differentiate it, through heuristic analysis, from VB.

I just don't have much of an idea of how to do it myself - I would probably have to learn some Ruby first :)

If there is a way to submit an enhancement request, please let me know - and if by any chance learning Ruby crosses my path I would more than happy to contribute.

Thank you all once again.

pchaigno commented 5 years ago

The contribution guidelines are here and don't require you to write much, if any, Ruby code. You will need a Ruby development environment though, to run a few helper scripts and the tests.

Alhadis commented 5 years ago

I just don't have much of an idea of how to do it myself - I would probably have to learn some Ruby first :)

From the top of CONTRIBUTING.md:

The majority of contributions won't need to touch any Ruby code at all.

Which, BTW, is 100% true. =) Most of the time, you'll just be editing YAML and configuration files, and even then it's easy and straightforward. Linguist was designed to be contributor-friendly, and @pchaigno and I are here to help explain anything which appears confusing. ;-)

The only part which requires a bit of effort is getting a local checkout of Linguist setup the first time. You may need to install some dependencies first; after that, just run script/bootstrap and wait for everything to finish installing. =) (it might take a while over a slow or busy connection, so grab a coffee or something).

It's a good idea to keep your local Linguist checkout handy, even after you've finished your submission, to save the hassle of going through the bootstrapping process the next time you have something to contribute. 👍

Alhadis commented 5 years ago

but i think we have to live with that probably ill-chosen extension for now

@tsurdilo This happens more than you'd think; for some odd reason, JSON winds up with all sorts of arbitrary file extensions (XML too), which surface in search results when I've been looking up a completely unrelated filetype...

So this isn't an isolated case, and while we can add a new file extension, it needs to have widespread enough usage first. Think of the mess we'd be bogged down with if each and every single JSON file without a .json extension had to be catalogued... 😭

Moral-of-the-story: Stick to the standard file-extension for a file format if if you're going to use it. 😉

vinischeidegger commented 5 years ago

So I started downloading the linguist repo to give it a try (apparently not Windows friendly, right?).

Anyway, the idea was to classify the RedHat jBPM files that have the .frm extension (sample here) as JSON. This would be done by also listing .frm as JSON and then, through heuristic analysis, differentiating them from the classic VB .frm (sample here)

Before spending any time on it, I would like to know whether it would be useful (by useful I mean, whether the contribution make sense and would be accepted), or if you guys think json files with different extensions should not be reclassified (even when they are part of a big solution such as jBPM).

By the way, jBPM documentation stating the file extension for its JSON forms (that can be either .form or .frm, the default being the latter) is here.

Please let me know how to proceed - or whether to close the topic by doing the .gitattribute workaround in my repos and live with wrong VB projects tags all around 😄

Thanks!

tsurdilo commented 5 years ago

@Alhadis agreed. would it be hard to allow projects to define their custom languages.yml which could be used to overwrite the linguist defaults?

Alhadis commented 5 years ago

@tsurdilo Linguist already offers users a mechanism to override certain semantics about the way their repository's files are indexed and classified. In your case, you would be adding this to your project's .gitattributes file:

*.frm linguist-language=JSON

@vinischeidegger Alright, I'm afraid I've bad news. Out of ~722,920 indexed .frm files, nearly all ~3,000 samples I'd collected were anything except JSON data. The summary of my findings is here:

Based on the fact only 0.006% of the samples were even JSON, and the fact that the .frm files you speak of appear to be machine-generated artefacts instead of human-written source code, and that the docs you linked to even describe this extension as something as a last-resort fallback when users generate a form ("frm"):

Sets the type of process/task forms to be generated/edited. If not set Designer will ask users to choose the type (".form", ".frm"). By setting this property you declare to use one of these two form types and users will no longer be asked to choose

😉 I'm sorry, but this is pretty much the total opposite of what we do accept (and require) for a new extension to be recognised/supported on GitHub.com. So, yes, I'm afraid you'll need to mark these files as JSON and/or as generated files (which they indeed appear to be), which is the same procedure expected of every user for correcting things on a per-project basis.

I hope this hasn't dampened your enthusiasm about contributing in future. 👍 If there's a bright-side to this, it's that you helped point out a pretty glaring omission in GitHub's contribution docs:

So I started downloading the linguist repo to give it a try (apparently not Windows friendly, right?).

... which couldn't be more ironic now that GItHub are owned by Microsoft, and we're note even acknowledging the existence of their own users yet. 😀 I'll fix this myself; I'm pretty handy with batchscript. 😉


Side-note: — If you're curious about the mechanism I used to collect the usage samples, you can find it here. Currently, GitHub lacks a user-friendly mechanism for collecting the URLs of search results en masse, which is why I wrote something which did. Be sure to read Harvester's readme before using it.

Alhadis commented 5 years ago

Ouch. Looks like this won't be easy for Windows users after all. 😞 Linguist has a hard dependency on a library called charlock_holmes, which expects its dependencies to be located in build directories which are standard locations for Unix-like systems, but don't exist on Windows.

I was about to roll my sleeves up and start getting my hands dirty when I realised no sane Windows user would be bothering at this point, and filing an issue on GitHub would be preferable to, say, studying 500 lines of setup instructions which would be turning our readme into a compiler's instruction manual...

Guess you'll need to ask GitHub's new CEO very nicely to burn his company's operating system erh, publish an MSI for github-linguist or something. 😁

vinischeidegger commented 5 years ago

@Alhadis, Thanks for the time to check the relevance of the change. I had already done the change to the repo (which was not mine, but as I ended up contributing with another stuff I included this fix/workaround as well).

But hey, 18 out of 3000 is 0.6% - not that the math really matter, but let's not diminish jBPM so much in front of VB - or whichever other language uses the .frm extension haha.

As for Windows, I'm a big fan of the Linux family myself (it is really good as a server and for dedicated jobs). It's been more than 20 years using the most different programming languages and environments, and this is something I always have good conversations with my friends and other developers I've met hahaha. Anyway I still stick with Windows for the availability of general software and knowledge base. While working at IBM I had a giant push to Unix and AIX and company wide was being pushed to use Ubuntu in their notebooks (it was back in 2006). I said that maybe in 2015 Linux would be massive... Well, 2015 is now gone and although Linux is massive now, I still do not see enough reasons to change. Maybe in 5 or 10 years, who knows... (if really really needed with couple clicks a Linux VM can be mounted - and even that with the advent of docker this becoming less and less needed)

Side talks aside, thanks once again!