Closed mintsoft closed 9 years ago
I've also just noticed it here:
Which is also Perl 5
I'm afraid there is not much that can be done with Linguist here. As you can guess, distinguishing Perl 5 and Perl 6 is tricky. The way Linguist does it is by first using some heuristic; if that doesn't work the Bayesian classifier tries to guess the language. But since the two languages are really close, it will often fail.
A few solutions to fix your case:
use v5;
(use strict;
will also work in your case)*.pm linguist-language=perl
in a .gitattributes
file to override Linguist results (this will only affect the statistics)@pchaigno I can certainly see how difficult this is. Just to clarify, if we use the .gitattributes
then it wouldn't fix the highlighting in the cases when it's incorrectly classified as 6?
Just to clarify, if we use the .gitattributes then it wouldn't fix the highlighting in the cases when it's incorrectly classified as 6?
That's correct. Unfortunately, the highlighting and the search results are currently not affected by the overrides :/
Dang, that's a real shame. Do you know if there's any plans to link the overrides in at some point?
I think it is considered. @arfon will know for sure.
Do you know if there's any plans to link the overrides in at some point?
Yeah, there's an issue for this here. We'd definitely like to add this functionality - it's just a bunch of work on our end (for GitHub/infrastructure reasons).
... for now, the EMacs and Vim overrides are probably your best bet: https://github.com/github/linguist#using-emacs-and-vim-modelines
Cool thanks @arfon We've actually opted for the use strict;
for now.
:+1: ok thanks @mintsoft
If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.
If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.
Because the 'likelihood' is determined by the bayesian classifier and not any other custom rules about languages which are more likely purely based on popularity.
As @pchaigno discussed, getting Linguist to do the right thing here is hard so I'd encourage you to take a look at the overrides to address this issue.
Based on the linked code, there seems to only be certain heuristics, but no way to provide a default 'else'?
...
else
Language["Perl"]
end
It is possible to provide a default 'else' but we try to avoid writing heuristics that way. Especially as we don't actually know the file extension we're working with when we're in the heuristic.
What about adding popularity of a language, however vaguely it is defined, as part of the heuristics?
Regardless of Linguist's implementation, Perl5 has 150,717 modules (not all of them on GitHub though), while Perl6 has only 330. It's incredibly more likely that an ambiguous file is Perl5 code, not Perl6.
What about adding popularity of a language, however vaguely it is defined, as part of the heuristics?
It's possible we could do something with popularity somewhere in the language classification but the heuristics isn't the right place.
I could even suggest that, in the interim, the lack of a use v6
ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.
Perl 6's design docs have some thought in the matter: see point 8 in this bullet point list.
In practice, this means that it should be safe to assume a .pm file is Perl by default, and only treat it as Perl 6 if the first line which is not empty and not a comment starts with one of:
use v6
class
module
unit
PS: unit
keyword is not mentioned in the above link, because it was only added recently. But unit class Foo;
will likely become the "default" way to start a Perl 6 module file, so don't miss it.
If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.
I agree with this, it does seem odd that the weighting appears to be in favour of Perl6
I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.
This seems sensible to me
I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.
Just asked about this on #perl6 IRC channel (irc://irc.freenode.net/#perl6) and FROGGS said that it likely won't work in this case, because we're also trying to decide whether the code is Prolog.. Here is the full discussion: http://pastebin.com/5LMub5X0
Also, to add to the discussion, nwc10 had this to suggest:
alternative question - is it viable to help train the basien filter - ie would github be happy with a button for "you're wrong, it's this other language?"
because we're also trying to decide whether the code is Prolog.
Surely that's relatively straight forward, if there are $
in it, it's not Prolog?
Would there be any possibility of re-opening this issue so someone with more intimate knowledge of Ruby/Linguist could perhaps see if any of the suggestions above are viable and could be implemented, to make Linguist's guesses more accurate?
@zoffixznet I've reopened it for now, I'll leave it open unless the github staff close it
excellent, thanks!
This is excellent, thank you, but just wondering, does this also help with .t files? I don't see it referenced in the commits, and these seem to be the most often marked as Perl6 mistakenly.
This is excellent, thank you, but just wondering, does this also help with .t files? I don't see it referenced in the commits, and these seem to be the most often marked as Perl6 mistakenly.
Not currently. We could write a heuristic that is only for .t
extensions which uses the same matchers for .pm
here. Note that .t
isn't unique to Perl
/Perl 6
- Turing
also uses it so we'd need to make sure to include that also.
What do you think @pchaigno?
@arfon in my experience, it's mostly the Perl 5's .t
files that Linguist mistakenly thinks are Perl 6, so it would be very helpful if those had the heuristic evaluation as well.
FYI: I feel like that problems described here is pretty much resolved by https://github.com/github-linguist/linguist/pull/6264 (ref. https://github.com/github-linguist/linguist/discussions/6263)
Hey,
We have a collection of
.pm
files, all of which are Perl 5; however about 10% are being classed as Perl 6 incorrectly:I've had a quick look and I can't see what the particular reason is. A specific example is: Erroneously Perl 6:
Correctly Perl 5:
cc @jagtalon @mrchrisw