Fixing C/C++/Objective-C classifications

arfon commented 9 years ago

A fair fraction of the outstanding issues with Linguist mis-classifications are to do with C, C++, Objective-C and any other languages that use .m or .h as extension.

While it's possible we can see further improvements by increasing our samples for the Bayesian classifier I'm pretty convinced we need to craft a few reliable heuristics to shortcut the classifier.

We currently have the following in heuristics.rb but it's not being called.

def self.disambiguate_c(data, languages)
  matches = []
  matches << Language["Objective-C"] if data.include?("@interface")
  matches << Language["C++"] if data.include?("#include <cstdint>")
  matches
end

@DX-MON had a good go at this in #1036 but at the time we didn't have a good way to benchmark the effects of these changes. With the new benchmarks we are now in a good place to tackle this problem.

The current heuristic is deliberately limited in scope, that's because any heuristics we define should be accurate, simple to understand and fast.

Basically I'm asking for some help with this. I'm more than happy to grab a bunch (1000s) of files to test any potential changes on. Questions for you all:

How do we feel about #1036 - I'd love to see a case for C in that disambiguate_c method too if possible.
1036 is matching on a bunch of keywords - are all of these necessary? Are we potentially sacrificing accuracy for completeness?
Are there any good heuristics for pure C code?

dragonmux commented 9 years ago

@arfon The keywords have been carefully selected as being unique to the specific language they are detecting for - for example, you cannot use @ syntax outside of comments in C++ as this is a syntax violation, so we can use @protocol, for example, to help detect Obj-C as protocol is an invalid C and C++ keyword and @ syntax is invalid for both too

More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using //! @class and such for documentation.

Also, looking specifically for std:: is a good benchmark for C++ as this syntax is unique to the STL.

arfon commented 9 years ago

Thanks @DX-MON - I've made a new PR with all of your commits in here: #1627. I made the Objective-C matchers a little more strict in this commit as @end was matching lines like these in a valid C file (and therefore classifying the file as Objective-C).

More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using //! @class and such for documentation.

Yeah, excluding lines that begin with // might be a good idea.

dragonmux commented 9 years ago

The other blocks of text to exclude if going after comments are lines enclosed in matching /* and */ pairs (hence the "and such").

I didn't know about \b so thank you for that :)

rofl0r commented 9 years ago

i'm re-iterating my assessment from #332 : imo the only halfways easy way to differentiate between C,C++,and Obj-C headers is by looking at the extensions of other files in the repo (i.e. context). if there are .m files and no .c files, the header is definitely obj-c, and if there are only .c files, it's cleary C. for C++, the possibilities are .C, .cpp, .cc, etc...

dragonmux commented 9 years ago

@rofl0r Unfortunately the way Linguist's framework is set up makes such a comparison impossible. Also, for many instances of misclassification, the header files are in a completely separate folder to the source files, making such a comparison impractical too - Linguist would need to be able to correlate unrelated directories of files for this to be truely effective. also, .C for C++ code is not always true - it is the DOS style file extension for C code (MS-DOS only had upper-case file names and extensions after all).

rofl0r commented 9 years ago

well, DOS-C is not really C anyway, as there is no single ANSI-C compliant compiler available which runs on DOS... and i would suspect this affects no more than 1 out of 1000 C repos, so the "DOS-C" niche is negligible.

dragonmux commented 9 years ago

I didn't say DOS-C, I said DOS style file naming. All sane compilers I know of treat .C and .c the same, same as they treat .CPP and .cpp the same. FAT32's primary fall-back mode is DOS file names, same with NTFS. Because these file systems are case insensitive and do that, the tools have to tread ice the exact same way. An example of the problem is #1054 and the 2D_dos repository's code reads as complete valid ISO C.

Ignoring that for a sec, though, and getting fully back on topic, #1202 also asked about extension comparison to aid with .h file disambiguation and the answer is no different from then - sorry.

khoek commented 9 years ago

I think that "typedef struct" is a pretty clear flag for C-only code.

dragonmux commented 9 years ago

Hi @escortkeel I am intrigued as to how you figure this seeing as it is not invalid C++ or Obj-C syntax. It is also used in C++ syntax headers for C interoperability. If these headers contain classes and other non-C constructs then they should not be marked as C headers. This is why I specifically went after keywords in Obj-C and C++ that occur in only those languages in complete isolation.

khoek commented 9 years ago

@DX-MON I haven't seen much C++ code with typedef structs personally. I totally agree about the interoperability stuff, though. Good point.

ShabbyX commented 9 years ago

I think first of all it's a good idea to make sure comments are never looked at! That should reduce some false positives, although I can't tell if it's actually a problem or not without statistics.

Nevertheless, a couple of other things to look out for are (I don't know Objective-C, so this would be C vs C++):

\<template\> -> C++
#include\s*[<"][^.]*[>"] -> C++ (it's not illegal to #include <header_no_dot> in C, but I doubt anyone actually does it)
std:: -> C++ (as mentioned before)
\<delete\s*[] -> C++
(header only) \<extern "C" -> C (1)
(header only) #ifdef\s*__cplusplus\> -> C (1)

(1) these cases show a C header that tries to be nice and be able to be included in a C++ project. With the check for __cplusplus, the header may be adding more C++ functionality also, but it is clear that the header is for a C project, even if it plays nice with (or even adds features for) C++.

dragonmux commented 9 years ago

@ShabbyX I intentionally did not make assumptions based on kinds of includes because it is not illegal to specify arbitrary extensions to the preprocessor - only C++ specific STL includes are heuristically detected. extern "C" is actually also a C++-sim as is #ifdef __cplusplus - feeding the former to a C compiler results in errors and the latter clearly demarks C++ specific code. They could be added to the C++ heuristic though as that might help

ShabbyX commented 9 years ago

Ok, the extern "C" on itself (if not surrounded by #ifdef __cplusplus) is C++ only, but I would argue against #ifdef __cplusplus marking the file as C++.

It is true that #ifdef __cplusplus is adding C++ code, but a header that contains such a test is definitely not written for C++, but just supports it. If it was a header for a C++ project, it wouldn't need #ifdef __cplusplus. I believe only header files of C libraries have that test, just to play nice with C++. It would be unfair to call them C++ headers, since they really belong to a C project.

dragonmux commented 9 years ago

I've actually written several libraries which are internally C++ with externally a C API using that #ifdef trick - I do not think it is a reliable heuristic because it is for inter-language interoperability only.

ntessore commented 9 years ago

Due to the fact that C++/Obj-C intentionally use similar (or even compatible) headers, there is no way this issue can be decided for all cases. But I think the following one rule might be a starting point:

If a header works in C, mark it as C.

Since C is the progenitor of the other languages, I think it is reasonable to mark files as C when possible. The rule immediately suggests a way to find heuristics for the other languages: Parse a large sample of input files using C language rules, and record the errors that are produced when given C++/Obj-C code. The most common errors would be good candidates for a heuristic to detect what language a header is written in.

(In the end, the above rule could even have additional value, showing compatible headers in a C++/Obj-C object.)

permcody commented 9 years ago

Our repository was previously correctly classified but recent changes have definitely broken the C/C++ differentiation in our repository. I thought I'd fix just by using the .gitattributes file, but I wasn't able to get that to ever work (even in the linguist repo) so I don't know what I'm doing wrong.

If you are interested, about half of our C++ files are now showing up under C. https://github.com/idaholab/moose

dragonmux commented 9 years ago

That would be due to the following logic errors in the regex's in the heuristic that I noticed as a result of this.. @arfon please could you fix these in a new PR:

Line 45 of heuristics.rb: /^\s*template\s*</.match(data) or /^[^@]class\s+\w+/.match(data) or /^[^@](private|public|protected):$/.match(data) or /std::.+$/.match(data))

needs changing to: /^\s*template\s*</.match(data) or /^[:blank:]*try/.match(data) or /^[:blank:]*catch\s*\(/.match(data) or /^[:blank:]*(class|namespace)\s+\w+/.match(data) or /^[:blank:]*(private|public|protected):$/.match(data) or /std::\w+/.match(data))

This adds checking for C++ try-catch and fixes the presence checks of @ in front of class, etc - they aren't needed because the regex already clamps to start-of-line - just junk excess white-space from the line and get on with it :).

lutoma commented 9 years ago

To chip in another repository where C is wrongly detected, https://github.com/lutoma/xelix has had wrong detections ever since it was created 4 years ago, and right now ~13%/2% are detected as C++/Obj-C respectively even though it doesn't have any. It's an osdev project though, so it might make use of weird patterns that may not occur in "normal" C.

dragonmux commented 9 years ago

@lutoma I've taken a look and this seems to only be because the classifier is getting it wrong, not because of the heuristics. I would suggest making a few of those headers (longer ones for preference) samples for the Bayesian classifier.

arfon commented 9 years ago

@DX-MON - does your earlier comment still hold? i.e. Does the regex still need fixing up?

dragonmux commented 9 years ago

@arfon Yes, the regex fix should cover @permcody's problem. @lutoma's is down to the classifier screwing up.

arfon commented 9 years ago

:+1: thanks for the clarification @DX-MON

lutoma commented 9 years ago

@DX-MON Ah, thanks for checking.

I don't know how exactly linguist works internally, but maybe it might also be worth a shot to try and figure out the language based on the other files in the repository. So if there's a .h file and the repository exclusively contains C, but no C++/Obj C files, it should probably be counted as C.

As for submitting samples for the Bayesian classifier, sure. What's the best way to go about that?

dragonmux commented 9 years ago

@lutoma The way you've done it is perfect :) any and all samples for the classifier help so don't worry it's a large number.

As for how Linguist works, it can and does only perform classification on a per-file basis, with the type that crops up most marking the repo as that type, weighted by lines of code.

ShabbyX commented 9 years ago

Looking at some headers classified as C, while they obviously shouldn't be, here are a couple more things to look out for:

If a non-comment contains ::, it's definitely not C
If a function definition contains &, it's definitely not C
If there is using namespace, it's definitely not C

I understand that some of these checks are harder, and may require some basic parsing, but they may probably be worth it.

dragonmux commented 9 years ago

@ShabbyX using namespace is a good point. an optional "(using[:blank:])?" to the namespace check would suffice there. @arfon if you would be so kind.

Testing for & like that is not possible with just a regex, I don't think Linguist is not about to get the complexity of a C++ parser - and it's certainly not worth it for the bugs it'd introduce - and same for :: - but this is where more samples for the classifier helps. It can be trained to do this sort of work with good samples, and lots of them.

arfon commented 9 years ago

From https://github.com/github/linguist/issues/2035 both of these files are being incorrectly identified as C++:

I haven't dug into why yet but wanted to report this.

We also have another new C/C++ issue here: https://github.com/github/linguist/issues/1993

phillipberndt commented 9 years ago

I have another example for a C header incorrectly classified as C++: https://github.com/phillipberndt/pqiv/blob/469fe63df4b27463f8accf8a45ae515a17c202e7/pqiv.h

I didn't follow this issue so far, but noticed that there are only 30 C headers, 21 C++ headers, and 9 ObjC headers in the samples directory. Are those the only files used for training the classifier? If so, then while I agree that a Bayesian classifier probably won't reach 100% accuracy, that's IMHO still a pretty small sample to distinguish languages that are this similar. Did anyone ever try to solve this by just throwing a couple of 100s (or 1000s) more header files at the learner?

dragonmux commented 9 years ago

@phillipberndt Please open a pull request with that header. The // based comments are probably what are making the classifier think C++.

Throwing more samples at the classifier is exactly what needs to be done but the problem is that permission needs to be given for the files to be included in the classifier samples, and hence why a data mining operation has not been done to fill out the classifier's C, C++ and Obj-C directories.

vitaut commented 9 years ago

Here's a whole bunch of mostly C++ files misclassified as C:

https://github.com/search?utf8=%E2%9C%93&q=language%3Ac+namespace&type=Code&ref=searchresults

There are 8,016,195 matches although there are multiple matches per file.

I think a good heuristic would be to check if a file contains namespace, typename, static_cast or other fairly unique C++ keyword.

vitaut commented 9 years ago

Here are two C++ files incorrectly identified as C:

https://github.com/cppformat/cppformat/blob/74169e4b5deb805a03ebe7db0c90c3ea196edbb1/test/posix-test.h https://github.com/cppformat/cppformat/blob/f2c9df8e9f1ecd824bb3fdb99a0761602c0be729/test/util.h

ntessore commented 9 years ago

@DX-MON There is a list of public domain libraries at http://unlicense.org. I think you could find a couple of hundred files there that would not need permission, as far as I understand (IANAL).

ShabbyX commented 9 years ago

@vitaut One interesting point with the github search results you mentioned is that it seems that most of the files have a .C extension which already means the file should be classified as C++.

vitaut commented 9 years ago

@ShabbyX Yes, although GitHub shows only 1000 results so it's not clear what's going on in the rest.

vitaut commented 9 years ago

Sorting in a different order shows some misclassified .h and even .c files (why are they putting C++ code in .c files?). There are correctly classified C cases too, but mostly because the search is not very specific.

dragonmux commented 9 years ago

@ShabbyX a ".C" file does not a C++ file make.. actually you shouldn't be using .C for C++ as it's the DOS form of a C file. likewise .H. Valid C++ implementation extensions, regardless of case, are .cc, .cpp, cxx. While .C is accepted by many compilers as C++, that behaviour is inconsistent with case insensitive file systems and DOS originating repositories.

vitaut commented 9 years ago

@DX-MON Well, according to Wikipedia, here's a list of C++ extensions:

.cc .cpp .cxx .C .c++ .h .hh .hpp .hxx .h++

Moreover, .C files are used for C++ code as can be seen from the search query I provided. So I think the detection should be based not on the file extension in this case, but on the file content.

dragonmux commented 9 years ago

@vitaut Your suggestions for typename and static_cast I like. We already do classifications on the std:: namespace as that is unique to C++, but in order to not accidentally match in comments, I did not write the heuristic to directly match the namespace keyword, only namespace std:: anchored to the start of a line.

dragonmux commented 9 years ago

@vitaut I cannot disagree about file content for detection, point I was making is that there were a large number of repos being incorrectly marked as C++ because of case sensitivity - .C for C++ is inherently broken in too many ways. Modifying languages.yml to include .C in both the C and C++ sections should do the trick to pass such files through the heuristic and classifier systems.

Also, if you look closely at your wiki link you will notice that .C is only valid as C++ on Unix/Linux machines. Case in point, Visual Studio ignores .C as C++ as do all other windows C++ compilers.

vitaut commented 9 years ago

@DX-MON Sure. Including .C both in C and C++ sections sounds reasonable.

lutoma commented 9 years ago

To chip in on this again, I just quickly ran

find ~/code -name "*.c" -exec cp {} samples/C/ \;
find samples/C/ -execdir sh -c "if ! isutf8 -q {}; then rm -v {}; fi" \;

which copied all C files from my code directory to the samples folder and pruned those that had invalid UTF-8 in them (surprisingly many, which made the tokenizer throw exceptions).

This added some ~31000 C sample files to the classifier (and increased the runtime of bundle exec rake samples to about 7 minutes for me, so there's that).

With the additional samples, the C++ misdetections decreased from 3.3% to 1.11%, so I guess throwing even more files at the problem should make it accurate enough to pull the C++ percentage below the display threshold for GitHub.

If wanted, I could probably collect a few 10k C files from various projects in the public domain (and maybe under non-copyleft open source licenses), but I would like some input from GitHub first if that works for you legally. Also, I'm not sure if storing all that in a git repository makes sense, since at that point we're basically storing huge "binary" blobs that will probably never be edited again.

dragonmux commented 9 years ago

I like it, the same would be good to have done with some representative C++ code too as that should drop the error rate even further.

You make a fine point about having so many files in repository as essentially a binary blob kind of deal. I wonder what GitHub staff or @arfon @bkeepers have to say on the matter. This weekend I plan to put together a patch set for introducing .C and .H back into the mix as an extension that must go through the detection machinery too.

arfon commented 9 years ago

This added some ~31000 C sample files to the classifier (and increased the runtime of bundle exec rake samples to about 7 minutes for me, so there's that).

With the additional samples, the C++ misdetections decreased from 3.3% to 1.11%, so I guess throwing even more files at the problem should make it accurate enough to pull the C++ percentage below the display threshold for GitHub.

Thanks for doing this @lutoma but I don't think this is a good way forward. Currently the classifier does some pretty weird stuff including weighting the results by the number of samples :open_mouth:. This basically means that your 'improved' results here could very well be because of the vastly increased sample set for C rather than an actual improvement in the performance/fidelity of the classifier.

At this point I would rather see us continue to focus on heuristics as a way forward for this issue and problematic repos (with for example lots of small files) people to make use of the various Linguist overrides available to them.

phillipberndt commented 9 years ago

Currently the classifier does some pretty weird stuff including weighting the results by the number of samples

Wouldn't it be a good idea to remove that weight? It doesn't make much sense for the current sample set either.

arfon commented 9 years ago

Wouldn't it be a good idea to remove that weight? It doesn't make much sense for the current sample set either.

In theory yes but that's potentially a huge change for us and so I don't think it's something we could do quickly.

We're honeslty looking at re-writing the classifier from scratch at this point.

benwaffle commented 9 years ago

It would be great if C++ headers without extensions (like these) could be marked correctly. linguist also affects syntax highlighting, right?

ShabbyX commented 9 years ago

It's interesting to note that those files start with // -*- C++ -*- which is actually a very easy and definitive way of detecting the file type. I'm not familiar with the syntax, but I guess that's for emacs? There is also a syntax for vim which linguist could use to detect the language as stated by the author.

On Wed, Mar 18, 2015 at 6:43 PM, benwaffle notifications@github.com wrote:

It would be great if C++ headers without extensions (like these https://github.com/llvm-mirror/libcxx/tree/master/include) could be marked correctly. linguist also affects syntax highlighting, right?

— Reply to this email directly or view it on GitHub https://github.com/github/linguist/issues/1626#issuecomment-83083058.

arfon commented 9 years ago

It's interesting to note that those files start with // -*- C++ -*- which is actually a very easy and definitive way of detecting the file type.

Yep, and Linguist should recognise these: https://github.com/github/linguist/blob/master/lib/linguist/strategy/modeline.rb ... I'll do some digging to see why it's not working.

arfon commented 9 years ago

@ShabbyX - looks like the Emacs modeline isn't working for C++ - this Pull Request will fix your issue: https://github.com/github/linguist/pull/2233

plasmLC commented 9 years ago

What should i do to suppress detecting my header files as C++ code? https://github.com/plasmLC/libvk

Thanks!

github-linguist / linguist

Fixing C/C++/Objective-C classifications #1626

1036 is matching on a bunch of keywords - are all of these necessary? Are we potentially sacrificing accuracy for completeness?