Closed httnn closed 8 years ago
Linguist::Language.popular
doesn't work because first language is ActionScript
.
Samples.cache['tokens']['ActionScript']
returns nil as no sample file exists for ActionScript.
I think 50 or so entries from languages.yml don't have sample files. Samples.cache
for them will fail as well. Are all languages suppose to have training data?
good point, I don't see any training data in the repo so probably not.. I think the question we should be asking is if it should crash because some training data is missing?
The same error usually happens when using Linguist without generating the samples.json
file.
Maybe generating this file would help...?
Not really. samples.json
is build by
Linguist::Samples.data
This comment on the method says:
Public: Build Classifier from all samples.
ie languages which don't have sample won't have a classifier. (and hence no entry in sample.json
for ActionScript)
Sure, but it doesn't say that the file is created. When using bundle exec rake samples
, it actually writes it to a file. Can you find this file on your disk?
@pchaigno I did use the same rake task. And yes, file exists. I just followed it to see how the content of the file was generated :bowtie:
@arfon Can you reproduce this issue?
here's a workaround:
require 'linguist'
blob = Linguist::Blob.new('file', File.read(ARGV[0]))
# filter out languages that aren't in Samples.cache['tokens']
languages = Linguist::Language.all.select { |x| Linguist::Samples.cache['tokens'][x.name] != nil }
matched = Linguist::Classifier.call(blob, languages)
@arfon Can you reproduce this issue?
Yeah, I can :-\
Samples.cache['tokens']['ActionScript']
The reason this is happening for languages like ActionScript
is that the .as
file extension that it defines in languages.yml
is a globally unique (in the Linguist language definitions) so we don't actually need a sample file as we never hit the classifier for .as
files, instead, the third strategy here - Linguist::Strategy::Filename
- returns the answer, not the classifier.
We actually have tests to check for samples when the extension isn't unique (as the classifier is then used) but we don't test for the presence of samples for all extensions: https://github.com/github/linguist/blob/master/test/test_samples.rb#L67-L91
I'm trying to write a very simple program that uses the bayesian classifier to detect languages by content only (ignoring extensions and mime types)
Yeah, this might cause troubles if you're simply relying upon the classifier (because of the lack of tokens for some languages). Is there any reason you can't use the standard interface, e.g.:
blob = Linguist::Blob.new('file', File.read('test'))
language = Linguist::Language.detect(blob)
Is there any reason you can't use the standard interface
well, I'm not completely sure. my program doesn't have a clue about filenames, the input comes straight from the user without any file information. would it make sense to use the standard interface?
my program doesn't have a clue about filenames, the input comes straight from the user without any file information. would it make sense to use the standard interface?
If you don't know the filename then probably not sorry. I'm afraid the only 'fix' that would work for you here is to find where there are samples missing for different language/extension combinations and add samples to Linguist for all of these.
I'm trying to write a very simple program that uses the bayesian classifier to detect languages by content only (ignoring extensions and mime types)
running this on OSX 10.11 results in:
some other things I tried as the second argument to
Classifier.call
:[Classifier::Language["PHP], Classifier::Language["Swift"]]
Linguist::Language.popular[1..10]
Linguist::Language.popular[1..10000]
Linguist::Language.all[1..100]
Linguist::Language.popular[0..10]
(first language is ActionScript)the test file contains some JavaScript
so it appears like only some
Language
s cause this error, is this by design or unintentional?