github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.32k stars 4.26k forks source link

NoMethodError when using Classifier on OSX #2697

Closed httnn closed 8 years ago

httnn commented 9 years ago

I'm trying to write a very simple program that uses the bayesian classifier to detect languages by content only (ignoring extensions and mime types)

require 'linguist'

blob = Linguist::Blob.new('file', File.read('test'))
all_languages = Linguist::Language.popular
languages = Linguist::Classifier.call(blob, all_languages)
p languages

running this on OSX 10.11 results in:

/usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:131:in `token_probability': undefined method `[]' for nil:NilClass (NoMethodError)
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:120:in `block in tokens_probability'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:119:in `each'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:119:in `inject'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:119:in `tokens_probability'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:105:in `block in classify'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:104:in `each'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:104:in `classify'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:78:in `classify'
    from /usr/local/lib/ruby/gems/2.2.0/gems/github-linguist-4.7.0/lib/linguist/classifier.rb:20:in `call'
    from test.rb:6:in `<main>'

some other things I tried as the second argument to Classifier.call:

the test file contains some JavaScript

so it appears like only some Languages cause this error, is this by design or unintentional?

sonalkr132 commented 9 years ago

Linguist::Language.popular doesn't work because first language is ActionScript.

Samples.cache['tokens']['ActionScript']

returns nil as no sample file exists for ActionScript.

I think 50 or so entries from languages.yml don't have sample files. Samples.cache for them will fail as well. Are all languages suppose to have training data?

httnn commented 9 years ago

good point, I don't see any training data in the repo so probably not.. I think the question we should be asking is if it should crash because some training data is missing?

pchaigno commented 9 years ago

The same error usually happens when using Linguist without generating the samples.json file. Maybe generating this file would help...?

sonalkr132 commented 9 years ago

Not really. samples.json is build by

Linguist::Samples.data

This comment on the method says:

Public: Build Classifier from all samples.

ie languages which don't have sample won't have a classifier. (and hence no entry in sample.json for ActionScript)

pchaigno commented 9 years ago

Sure, but it doesn't say that the file is created. When using bundle exec rake samples, it actually writes it to a file. Can you find this file on your disk?

sonalkr132 commented 9 years ago

@pchaigno I did use the same rake task. And yes, file exists. I just followed it to see how the content of the file was generated :bowtie:

pchaigno commented 9 years ago

@arfon Can you reproduce this issue?

httnn commented 9 years ago

here's a workaround:

require 'linguist'

blob = Linguist::Blob.new('file', File.read(ARGV[0]))
# filter out languages that aren't in Samples.cache['tokens']
languages = Linguist::Language.all.select { |x| Linguist::Samples.cache['tokens'][x.name] != nil }
matched = Linguist::Classifier.call(blob, languages)
arfon commented 8 years ago

@arfon Can you reproduce this issue?

Yeah, I can :-\

Samples.cache['tokens']['ActionScript']

The reason this is happening for languages like ActionScript is that the .as file extension that it defines in languages.yml is a globally unique (in the Linguist language definitions) so we don't actually need a sample file as we never hit the classifier for .as files, instead, the third strategy here - Linguist::Strategy::Filename - returns the answer, not the classifier.

We actually have tests to check for samples when the extension isn't unique (as the classifier is then used) but we don't test for the presence of samples for all extensions: https://github.com/github/linguist/blob/master/test/test_samples.rb#L67-L91

I'm trying to write a very simple program that uses the bayesian classifier to detect languages by content only (ignoring extensions and mime types)

Yeah, this might cause troubles if you're simply relying upon the classifier (because of the lack of tokens for some languages). Is there any reason you can't use the standard interface, e.g.:

blob = Linguist::Blob.new('file', File.read('test'))
language = Linguist::Language.detect(blob)
httnn commented 8 years ago

Is there any reason you can't use the standard interface

well, I'm not completely sure. my program doesn't have a clue about filenames, the input comes straight from the user without any file information. would it make sense to use the standard interface?

arfon commented 8 years ago

my program doesn't have a clue about filenames, the input comes straight from the user without any file information. would it make sense to use the standard interface?

If you don't know the filename then probably not sorry. I'm afraid the only 'fix' that would work for you here is to find where there are samples missing for different language/extension combinations and add samples to Linguist for all of these.