github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.32k stars 4.27k forks source link

Some .pm categorised as Perl6, some as Perl #2149

Closed mintsoft closed 9 years ago

mintsoft commented 9 years ago

Hey,

We have a collection of .pm files, all of which are Perl 5; however about 10% are being classed as Perl 6 incorrectly:

I've had a quick look and I can't see what the particular reason is. A specific example is: Erroneously Perl 6:

Correctly Perl 5:

cc @jagtalon @mrchrisw

mintsoft commented 9 years ago

I've also just noticed it here:

Which is also Perl 5

pchaigno commented 9 years ago

I'm afraid there is not much that can be done with Linguist here. As you can guess, distinguishing Perl 5 and Perl 6 is tricky. The way Linguist does it is by first using some heuristic; if that doesn't work the Bayesian classifier tries to guess the language. But since the two languages are really close, it will often fail.

A few solutions to fix your case:

mintsoft commented 9 years ago

@pchaigno I can certainly see how difficult this is. Just to clarify, if we use the .gitattributes then it wouldn't fix the highlighting in the cases when it's incorrectly classified as 6?

pchaigno commented 9 years ago

Just to clarify, if we use the .gitattributes then it wouldn't fix the highlighting in the cases when it's incorrectly classified as 6?

That's correct. Unfortunately, the highlighting and the search results are currently not affected by the overrides :/

mintsoft commented 9 years ago

Dang, that's a real shame. Do you know if there's any plans to link the overrides in at some point?

pchaigno commented 9 years ago

I think it is considered. @arfon will know for sure.

arfon commented 9 years ago

Do you know if there's any plans to link the overrides in at some point?

Yeah, there's an issue for this here. We'd definitely like to add this functionality - it's just a bunch of work on our end (for GitHub/infrastructure reasons).

arfon commented 9 years ago

... for now, the EMacs and Vim overrides are probably your best bet: https://github.com/github/linguist#using-emacs-and-vim-modelines

mintsoft commented 9 years ago

Cool thanks @arfon We've actually opted for the use strict; for now.

arfon commented 9 years ago

:+1: ok thanks @mintsoft

kraih commented 9 years ago

If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.

arfon commented 9 years ago

If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.

Because the 'likelihood' is determined by the bayesian classifier and not any other custom rules about languages which are more likely purely based on popularity.

As @pchaigno discussed, getting Linguist to do the right thing here is hard so I'd encourage you to take a look at the overrides to address this issue.

jberger commented 9 years ago

Based on the linked code, there seems to only be certain heuristics, but no way to provide a default 'else'?

  ...
  else
    Language["Perl"]
  end
arfon commented 9 years ago

It is possible to provide a default 'else' but we try to avoid writing heuristics that way. Especially as we don't actually know the file extension we're working with when we're in the heuristic.

zoffixznet commented 9 years ago

What about adding popularity of a language, however vaguely it is defined, as part of the heuristics?

Regardless of Linguist's implementation, Perl5 has 150,717 modules (not all of them on GitHub though), while Perl6 has only 330. It's incredibly more likely that an ambiguous file is Perl5 code, not Perl6.

arfon commented 9 years ago

What about adding popularity of a language, however vaguely it is defined, as part of the heuristics?

It's possible we could do something with popularity somewhere in the language classification but the heuristics isn't the right place.

jberger commented 9 years ago

I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.

smls commented 9 years ago

Perl 6's design docs have some thought in the matter: see point 8 in this bullet point list.

In practice, this means that it should be safe to assume a .pm file is Perl by default, and only treat it as Perl 6 if the first line which is not empty and not a comment starts with one of:

use v6
class
module
unit

PS: unit keyword is not mentioned in the above link, because it was only added recently. But unit class Foo; will likely become the "default" way to start a Perl 6 module file, so don't miss it.

mintsoft commented 9 years ago

If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.

I agree with this, it does seem odd that the weighting appears to be in favour of Perl6

I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.

This seems sensible to me

zoffixznet commented 9 years ago

I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.

Just asked about this on #perl6 IRC channel (irc://irc.freenode.net/#perl6) and FROGGS said that it likely won't work in this case, because we're also trying to decide whether the code is Prolog.. Here is the full discussion: http://pastebin.com/5LMub5X0

Also, to add to the discussion, nwc10 had this to suggest:

alternative question - is it viable to help train the basien filter - ie would github be happy with a button for "you're wrong, it's this other language?"
mintsoft commented 9 years ago

because we're also trying to decide whether the code is Prolog.

Surely that's relatively straight forward, if there are $ in it, it's not Prolog?

zoffixznet commented 9 years ago

Would there be any possibility of re-opening this issue so someone with more intimate knowledge of Ruby/Linguist could perhaps see if any of the suggestions above are viable and could be implemented, to make Linguist's guesses more accurate?

mintsoft commented 9 years ago

@zoffixznet I've reopened it for now, I'll leave it open unless the github staff close it

jberger commented 9 years ago

excellent, thanks!

Grinnz commented 9 years ago

This is excellent, thank you, but just wondering, does this also help with .t files? I don't see it referenced in the commits, and these seem to be the most often marked as Perl6 mistakenly.

arfon commented 9 years ago

This is excellent, thank you, but just wondering, does this also help with .t files? I don't see it referenced in the commits, and these seem to be the most often marked as Perl6 mistakenly.

Not currently. We could write a heuristic that is only for .t extensions which uses the same matchers for .pm here. Note that .t isn't unique to Perl/Perl 6 - Turing also uses it so we'd need to make sure to include that also.

What do you think @pchaigno?

zoffixznet commented 9 years ago

@arfon in my experience, it's mostly the Perl 5's .t files that Linguist mistakenly thinks are Perl 6, so it would be very helpful if those had the heuristic evaluation as well.

polamjag commented 9 months ago

FYI: I feel like that problems described here is pretty much resolved by https://github.com/github-linguist/linguist/pull/6264 (ref. https://github.com/github-linguist/linguist/discussions/6263)