Classifier gives false positive if no heuristic gives a positive (and giving negative impossible)

veox commented 7 years ago

This has been extracted (and somewhat expanded) from a comment in PR #3560. See that for context.

Set in question

As commit d852415204baaa94d53ac9fd98d7dbe1c873e230 notes, and as is discussed in PR #2634, there are many files with the .sol extension.

Known categories:

Solidity. 1.1. Modern uses can be recognised by use of pragma solidity (search). 1.2. Older files (written before the introduction of pragma) can be rec-d by use of contract keyword (search), but this is incomplete. Different from what was written in PR #2634, contract keyword is not mandatory, and may not be present for libraries - they will have library instead.
Eagle (electronics CAD) solder mask (or similar), rec-d by ending in M02* (search).
Solution files to UVa Online Judge problems, e.g. this one. The site seems to be a library of computational problems with means for automated solution submission. The files in question mostly contain space-separated numbers. This extension/format is not official: their submission specification page states that no file read/writes are to be performed, so the .sol is most likely a user convention to present results more easily to each other.
Other? Drawing CAD solids? Solar flare measurements? SoLame scripts?..

Classifier gives false positive if no heuristic gives a positive

Using the current approach with extension-based classifiers and content-based heuristics, and implementing them naïvely as in PR #3560, gives the following results.

(1.1) and (2) are properly detected as (respectively) Solidity and Eagle by the simple heuristic included in this PR.

~~The others are not. It is unclear to me whether proposed heuristic would skip classification for files not matching either regex.~~

(3) is erroniously detected as Eagle. Seems that the first candidate (listed alphabetically) is always used if:

the file is not empty;
no heuristic gives a positive hit; and
there is at least one extension-based classifier guess.

Also note how "no match" is identical to "no answer" - they're both nil.

Issue description (finally!)

At the root of the issue is the method of propagating answers from heuristics to classifiers.

Heuristics may give a positive hit, indicated by a Language['Somename']. Depending on the heuristic's implementation, there may be several positives, or just one if the heuristic is naïve (as here), and skips checks after encountering the first hit.

If there is no positive hit, and there were several languages claiming an extension, the classifier will have an array of several languages, but nothing to match them to, and will use the first item available.

Proposed solution (`not-a-language`)

A better approach may be using "content signatures" (one per content type) instead of heuristics for "contentious" extensions (sorry for the pun). This does seem to require a large refactoring.

That is, instead of having one heuristic for an extension, with checks for every language that lists that extension, a language could list an extension and a heuristic for detecting its own flavour of file contents.

The requirement to provide a signature should be optional, allowing languages to "claim" entire extensions, if there are no more "contenders", as in the status quo.

If an "extension-namespace collision", such as described in this issue, becomes known, then that extension can be added to a special fallback signature for not-a-language, that would match any content (perhaps even "no content"). Languages claiming to have the same extension, but providing no signature, could then be ignored.

This would require tests: adding a new signature should not result in erroneous de-classification or re-classification.

not-a-language should have no extensions associated with it at the beginning.

I could provide a demonstration of the issue with a dummy project and a CI job, but that would take a lot more work. Is that necessary, or is the above description sufficient?

veox commented 7 years ago

Moved from PR https://github.com/github/linguist/pull/3560#issuecomment-293069399.

With the current codebase, it is possible to indicate not-a-language with a one-line change (different branch to this PR's). See next commit for example use.

However, I'm very unexperienced with Ruby, and this approach may be too hackish. (But I like it. 😄 )

EDIT: Ping @pchaigno as having proposed not-a-language in PR #2634.

veox commented 7 years ago

PR #3196 (blocked) linked above tried to work around the current behaviour by allowing a guard check that never fired to do its thing.

veox commented 7 years ago

🐟 ❤️ 🏷

pchaigno commented 6 years ago

If there is no positive hit, and there were several languages claiming an extension, the classifier will have an array of several languages, but nothing to match them to, and will use the first item available.

You omitted the Classifier strategy which uses a naive Bayesian classifier trained with our sample files. Although it comes as our last-resort strategy, the Classifier is actually doing most of the classifications here. And, unfortunately, I think it is the real issue with the not-a-language option. In its current form, the Classifier is not always doing a very good job---which is why the heuristic rules were introduced in the first place---so we don't know if it could compute reliable confidence scores from which we could build a not-a-language option.

A better approach may be using "content signatures" (one per content type) instead of heuristics for "contentious" extensions (sorry for the pun). This does seem to require a large refactoring. That is, instead of having one heuristic for an extension, with checks for every language that lists that extension, a language could list an extension and a heuristic for detecting its own flavour of file contents.

I'm not sure I understand the difference between content signatures and our current heuristic rules. Are you proposing to have a single heuristic rule per language?

veox commented 6 years ago

I'm not sure I understand the difference between content signatures and our current heuristic rules. Are you proposing to have a single heuristic rule per language?

Yes, essentially.

Alhadis commented 6 years ago

Paraphrasing my suggestion in #3589: I think a reasonable (albeit irregular) solution would be to assign priority to our heuristics, so some (more important) heuristics are tested first, before the Bayesian classifier. Others, like a heuristic which tests for an XML doctype, could be run after the Bayesian classifier. This would require a non-trivial refactoring of our current logic (and I feel @pchaigno is better equipped to address this than I), but it would allow us finer control over edge-cases like these ones.

I initially thought of matching based on signature patterns, but realised earlier tonight that's essentially just a rebranded heuristic, only with a different priority level. Which convinced me that perhaps a priority level would be the key here...

This may have been partially-inspired by a similar mechanism used by file-icons, although its role is arguably different since its entire matching strategy is indexed and ordered by priority-before-alphabetisation.

pchaigno commented 6 years ago

Others, like a heuristic which tests for an XML doctype, could be run after the Bayesian classifier.

That's part of our not-a-language issue: there's nothing after the Bayesian classifier. There cannot be anything as long as we're not able to associate confidence scores to the classifier's results.

When we leave the Bayesian classifier, we have n languages and n associated matching scores. To decide that we need to run an extra strategy, we'd have to know when the Bayesian classifier's results are not satisfactory. Hence, the confidence score.

Alhadis commented 6 years ago

Is there any way we could run this outside of the usual logic? I mean, if the command-line executable returns a blank, that's a dead-end that could signal another phase of checks...

Alhadis commented 6 years ago

BTW, here are some other ideas for last_resort heuristics:

Makefile.*:
Check filename's prefix to see if it matches Makefile.*. If so, look for basic Make syntax. Many projects have Makefiles suffixed with processor/architecture (e.g., Makefile.arm, Makefile.intel, etc).
APL files without an extension:
If it contains APL symbols, and they're valid APL expressions, it's probably APL.
C files with a strange extension in a repository packed with confirmed C files:
I've seen this numerous times (often while going through harvest results): an ad-hoc extension added to a C-like language that's pretty obviously C, but without a standard file-extension. Often tests or fixtures.

Etc... man, I'm running my mouth at this point, and probably getting too ahead of myself. Take from this whatcha will. :)

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

github-linguist / linguist

Classifier gives false positive if no heuristic gives a positive (and giving negative impossible) #3589

Set in question

Classifier gives false positive if no heuristic gives a positive

Issue description (finally!)

Proposed solution (`not-a-language`)

github-linguist / linguist

Classifier gives false positive if no heuristic gives a positive (and giving negative impossible) #3589

Set in question

Classifier gives false positive if no heuristic gives a positive

Issue description (finally!)

Proposed solution (not-a-language)

Proposed solution (`not-a-language`)