Closed veox closed 5 years ago
Moved from PR https://github.com/github/linguist/pull/3560#issuecomment-293069399.
With the current codebase, it is possible to indicate not-a-language
with a one-line change (different branch to this PR's). See next commit for example use.
However, I'm very unexperienced with Ruby, and this approach may be too hackish. (But I like it. 😄 )
EDIT: Ping @pchaigno as having proposed not-a-language
in PR #2634.
PR #3196 (blocked) linked above tried to work around the current behaviour by allowing a guard check that never fired to do its thing.
🐟 ❤️ 🏷
If there is no positive hit, and there were several languages claiming an extension, the classifier will have an array of several languages, but nothing to match them to, and will use the first item available.
You omitted the Classifier strategy which uses a naive Bayesian classifier trained with our sample files. Although it comes as our last-resort strategy, the Classifier is actually doing most of the classifications here. And, unfortunately, I think it is the real issue with the not-a-language
option. In its current form, the Classifier is not always doing a very good job---which is why the heuristic rules were introduced in the first place---so we don't know if it could compute reliable confidence scores from which we could build a not-a-language
option.
A better approach may be using "content signatures" (one per content type) instead of heuristics for "contentious" extensions (sorry for the pun). This does seem to require a large refactoring. That is, instead of having one heuristic for an extension, with checks for every language that lists that extension, a language could list an extension and a heuristic for detecting its own flavour of file contents.
I'm not sure I understand the difference between content signatures and our current heuristic rules. Are you proposing to have a single heuristic rule per language?
I'm not sure I understand the difference between content signatures and our current heuristic rules. Are you proposing to have a single heuristic rule per language?
Yes, essentially.
Paraphrasing my suggestion in #3589: I think a reasonable (albeit irregular) solution would be to assign priority to our heuristics, so some (more important) heuristics are tested first, before the Bayesian classifier. Others, like a heuristic which tests for an XML doctype, could be run after the Bayesian classifier. This would require a non-trivial refactoring of our current logic (and I feel @pchaigno is better equipped to address this than I), but it would allow us finer control over edge-cases like these ones.
I initially thought of matching based on signature patterns, but realised earlier tonight that's essentially just a rebranded heuristic, only with a different priority level. Which convinced me that perhaps a priority level would be the key here...
This may have been partially-inspired by a similar mechanism used by file-icons
, although its role is arguably different since its entire matching strategy is indexed and ordered by priority-before-alphabetisation.
Others, like a heuristic which tests for an XML doctype, could be run after the Bayesian classifier.
That's part of our not-a-language
issue: there's nothing after the Bayesian classifier. There cannot be anything as long as we're not able to associate confidence scores to the classifier's results.
When we leave the Bayesian classifier, we have n languages and n associated matching scores. To decide that we need to run an extra strategy, we'd have to know when the Bayesian classifier's results are not satisfactory. Hence, the confidence score.
Is there any way we could run this outside of the usual logic? I mean, if the command-line executable returns a blank, that's a dead-end that could signal another phase of checks...
BTW, here are some other ideas for last_resort
heuristics:
Makefile.*
:Makefile.*
. If so, look for basic Make syntax. Many projects have Makefiles suffixed with processor/architecture (e.g., Makefile.arm
, Makefile.intel
, etc).Etc... man, I'm running my mouth at this point, and probably getting too ahead of myself. Take from this whatcha will. :)
This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.
This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.
This has been extracted (and somewhat expanded) from a comment in PR #3560. See that for context.
Set in question
As commit d852415204baaa94d53ac9fd98d7dbe1c873e230 notes, and as is discussed in PR #2634, there are many files with the
.sol
extension.Known categories:
pragma solidity
(search). 1.2. Older files (written before the introduction ofpragma
) can be rec-d by use ofcontract
keyword (search), but this is incomplete. Different from what was written in PR #2634,contract
keyword is not mandatory, and may not be present for libraries - they will havelibrary
instead.M02*
(search)..sol
is most likely a user convention to present results more easily to each other.Classifier gives false positive if no heuristic gives a positive
Using the current approach with extension-based classifiers and content-based heuristics, and implementing them naïvely as in PR #3560, gives the following results.
(1.1) and (2) are properly detected as (respectively)
Solidity
andEagle
by the simple heuristic included in this PR.The others are not. It is unclear to me whether proposed heuristic would skip classification for files not matching either regex.(3) is erroniously detected as
Eagle
. Seems that the first candidate (listed alphabetically) is always used if:Also note how "no match" is identical to "no answer" - they're both
nil
.Issue description (finally!)
At the root of the issue is the method of propagating answers from heuristics to classifiers.
Heuristics may give a positive hit, indicated by a
Language['Somename']
. Depending on the heuristic's implementation, there may be several positives, or just one if the heuristic is naïve (as here), and skips checks after encountering the first hit.If there is no positive hit, and there were several languages claiming an extension, the classifier will have an array of several languages, but nothing to match them to, and will use the first item available.
Proposed solution (
not-a-language
)A better approach may be using "content signatures" (one per content type) instead of heuristics for "contentious" extensions (sorry for the pun). This does seem to require a large refactoring.
That is, instead of having one heuristic for an extension, with checks for every language that lists that extension, a language could list an extension and a heuristic for detecting its own flavour of file contents.
The requirement to provide a signature should be optional, allowing languages to "claim" entire extensions, if there are no more "contenders", as in the status quo.
If an "extension-namespace collision", such as described in this issue, becomes known, then that extension can be added to a special fallback signature for
not-a-language
, that would match any content (perhaps even "no content"). Languages claiming to have the same extension, but providing no signature, could then be ignored.This would require tests: adding a new signature should not result in erroneous de-classification or re-classification.
not-a-language
should have no extensions associated with it at the beginning.I could provide a demonstration of the issue with a dummy project and a CI job, but that would take a lot more work. Is that necessary, or is the above description sufficient?