github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.29k stars 4.25k forks source link

Perl 5 vs Perl 6 mis-detection #4090

Closed zoffixznet closed 6 years ago

zoffixznet commented 6 years ago

Preliminary Steps

Please confirm you have...

Problem Description

Perl 5 and Perl 6 are different languages (despite similar names; like C vs C++). However, the heuristic fails in some cases. This was previously opened in https://github.com/github/linguist/issues/3906 and the cases covered in that issue were resolved.

URL of the affected repository:

https://github.com/zoffixznet/Music-ScaleNote

As far as I can see, the heuristic is failing because the heuristic looks for use strict, but the module uses use strictures instead.

Last modified on:

No idea

Expected language:

Perl 5 (aka "Perl")

Detected language:

Perl 6

Details

Since the previous times we had ambiguity issues, the Perl 6 language has now decided more strongly that its extensions are .pm6 for modules, .p6 for files and .t for scripts.

Thus, I'd argue that the heuristic for .pm and .pl files should prefer Perl 5 much more strongly. Likely just checking for Perl 6 first, with use\s+v6\b only, not the current regex, and marking code as Perl 6 if detected, otherwise, going to Perl 5. (personally, I'd say it shouldn't be disambiguated with Perl 6 at all and just assume those extensions are always Perl 5 or other lang).

That leaves .t disambiguation. I'm less certain of this one, but just thinking that it can be improved if Perl 6 is tested first and checked against use\s+v6\b, along with some testing subs that have a dash in their names done-testing|is-deeply|isa-ok|throws-like. If that doesn't match, then assume the code's Perl 5 or other lang.

If someone submits a PR, is there a way to somehow preview that the changes give right results? Is installing Linguist locally sufficient?

pchaigno commented 6 years ago

As far as I can see, the heuristic is failing because the heuristic looks for use strict, but the module uses use strictures instead.

Could we add use strictures to the Perl 5 heuristic rule then?

We should also be able to use the shebang in the .t file. I'll look into it.

If someone submits a PR, is there a way to somehow preview that the changes give right results? Is installing Linguist locally sufficient?

Yes, it's sufficient to test detection.

Grinnz commented 6 years ago

Could we add use strictures to the Perl 5 heuristic rule then?

This is a losing battle - here's the most comprehensive list I know of for modules that import strict and warnings for you, and this isn't including opinionated ones that only use part of strict: https://metacpan.org/source/PETDANCE/Perl-Critic-1.130/lib/Perl/Critic/Utils/Constants.pm#L84-126

We should also be able to use the shebang in the .t file. I'll look into it.

There is no requirement for perl test files to contain a shebang, since they are normally executed via a makefile or prove. Note #!perl as in that file isn't even a valid shebang, it's just a convention.

pchaigno commented 6 years ago

This is a losing battle - here's the most comprehensive list I know of for modules that import strict and warnings for you, and this isn't including opinionated ones that only use part of strict: https://metacpan.org/source/PETDANCE/Perl-Critic-1.130/lib/Perl/Critic/Utils/Constants.pm#L84-126

Heuristic rules were never meant to catch 100% cases (I guess it's in the name...). I'm only asking here if adding use strictures would help us catch a greater number of Perl 5 files, or is it only used in rare cases? (i.e., is it worth it?)

There is no requirement for perl test files to contain a shebang, since they are normally executed via a makefile or prove.

The Shebang strategy is also a heuristic, so same as above.

Note #!perl as in that file isn't even a valid shebang, it's just a convention.

Sure, but it's a convention we should recognize.

zoffixznet commented 6 years ago

I'm only asking here if adding use strictures would help us catch a greater number of Perl 5 files

It would, but for .pm/.pl files, it's overwhelmingly likely for them to be Perl 5 and not Perl 6 code, which is why I suggested trying to look for Perl 6 code first, and if Perl 6 things aren't present to assume Perl 5. This way, even if the Perl 5 code using one of those dozens modules that provide strict, it'd still be detected correctly. And if Perl 6 code is mis-detected as Perl 5, the author always has the option to simply switch to the more common .pm6/.p6 extensions.

pchaigno commented 6 years ago

As far as I can see (with GitHub.com's search engine), we still have quite a few .pl and .pm Perl 6 files. Of course, these search use the use v6 statement, so we can use our current heuristic rule to detect those files. But the point is that many still use .pl and .pm, and I doubt all of them have use v6 at the top. In any case, we try to avoid these kinds of fallback else in heuristic rules.

For now, I think we can address this by both improving the heuristic rule for Perl 5 and fixing the issue with the shebang. If you have other cases of misclassifications I can look into them, but it's unlikely will be able to detect them all accurately (that's what Linguist overrides are for). If/When the number of .pl and .pm Perl 6 files gets low enough, we'll change the heuristic rule or even remove those extensions for Perl 6.

Grinnz commented 6 years ago

I would still ask why it should mark a file as Perl 6 when it can't disambiguate between them. Perl 5 usage is far more widespread and as noted will be the only use of these extensions going forward, so it's a more sensible default.

pchaigno commented 6 years ago

I would still ask why it should mark a file as Perl 6 when it can't disambiguate between them.

Because the Heuristic strategy is not the last strategy. It's actually the naive Bayesian classifier that runs last and misclassifies your files here.

pchaigno commented 6 years ago

4099 was just merged, so this issue should be fixed with the next release of Linguist. There's usually a new release every couple weeks.