Bayesian classifier sample size is far too small

fingolfin commented 10 years ago

Linguist is using a bayesian classifier to detect languages. However, the sample data provided to it is far too small. According to `wc -l samples// there are just 309505 lines of sample code for about 220 "languages", making for an average of 1400 lines of sample code for each language.

I am surprised that despite this, linguist manages to work reasonably well -- my guess is that this is mostly due to the fact that data is pre-binned based on file extensions (fair enough). But as soon as more than one language uses the same extension, detection rates break down, as the tons of reports on Objective C vs C++ vs C show.

Which is not surprising -- the current approach linguist takes is akin to showing a spam filter half a dozen spam mails, no regular mails, and then expecting it to work well.

So, if linguist wants to keep using a Bayesian filter, it seems to me that the sample size should be enlarged substantially.

An alternative would be to use more hand-crafted heuristics, as has been suggested by many people on many issue reports and pull requests, but that has its drawbacks, too -- on the one hand, heuristics are hard to maintain, and on the other, they are surprisingly hard to get right, too -- most people are pretty bad at doing it. This is indeed something were Bayesian classifiers usually outperform humans. But only if fed with enough sample data...

bkeepers commented 10 years ago

@fingolfin thanks for the feedback! As you point out, the bayesian classifier is only used to resolve conflicts on file extension, so it tends to work pretty well for most cases.

An alternative would be to use more hand-crafted heuristics

We have started doing this in a few cases (see #1522).

This is indeed something were Bayesian classifiers usually outperform humans.

I would be interested in seeing some pull requests that add more samples to see how it effects results and performance.

arfon commented 10 years ago

Yep, thanks for this feedback @fingolfin.

So, if linguist wants to keep using a Bayesian filter, it seems to me that the sample size should be enlarged substantially.

This should presumably not be too difficult for languages such as Objective-C, C++ and C. The slight problem we have from our (GitHub's) side though is that our only way to grab a large corpus of files is to use our search to find them which is based upon the (current) Linguist classifier results. This means that even if we dropped 1000 'C++' files from our search into the classifier samples then we're pretty much guaranteed to have Objective-C and C results in there (thus defeating the object of the exercise).

So I guess the best approach is to manually add some known samples from each language and measure the effect with our benchmarking tasks

fingolfin commented 10 years ago

Yeah, just like with a spam filter, it won't work completely autonmously -- corrective actions by the user are needed. Just like I can tell my spam filter "no, that was actually ham", and then the filter (hopefully / at least in theory) learns from that.

For the initial training, it would indeed make sense to take a set of repositories where the status of files are "known". I am pretty sure this could be done, and I am hopeful that a lot of community members would be eager to help with this.

arfon commented 9 years ago

We're planning on working on this extensively over the coming months https://github.com/github/gsoc#linguist so closing this out. Thanks for your thoughts @fingolfin.

github-linguist / linguist

Bayesian classifier sample size is far too small #1572