Closed Ch4s3 closed 7 years ago
Additionally, the classifications
method should return a score hash with Infinity
value for each category if the word_hash
is empty like this:
def classifications(text)
score = {}
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
if word_hash.empty
category_keys.each do |category|
score[category.to_s] = Float::INFINITY
end
return score
end
#... remaining code
end
I had these changes in a local branch on my code base, but since you are taking care of it, I will just discard that.
So I made these changes, but it breaks this test .
10.times do |a_number|
assert_equal 'Digit', b.classify(a_number.to_s)
end
But, I'm not sure it even makes sense to classify a single digit that way.
So I made these changes, but it breaks this test .
This is happening because how the tokenization is implemented currently. If the length of the token is not greater than 2 letters then it is skipped and not added in the word_hash
. I think it was a meaningless test because there was only one category present and the system was returning that category with some score no matter what was passed to classify. Those categories were actually not being trained with anything.
In the new implementation of the classification
method, with empty strings getting Infinity
score, and a threshold is set, the classify
method would return nil
.
next unless word.length > 2 && !STOPWORDS[language].include?(word)
In conclusion, we need to change the test so that it trains for numbers above 100. This should make the test pass. Then we need to think if we want the limit of length where the token is nothing but a number such as 2
, 20
, or .5
. This makes it more important to be able to pass a custom tokenizer (#131) for situations like this.
I changed the test to use words, which I think better reflects the intent, and I tweaked the threshold. But in general I agree that it was a bad test, we seem to have a lot of those.
Yeah, let's prioritize #131, it will be pretty useful going forward.
@ibnesayeed any insight into why bayesian_integration_test.rb:35
fails on newer rubies with this PR?
The results of classification_scores(@memory_classifier)
and classification_scores(@redis_classifier)
look nearly identical but the hashes are different. I'm guessing this is a floating point math issue, but I'm not sure.
I changed the test to use words, which I think better reflects the intent...
You might want to make it something like 100.upto(110) do ...
.
@ibnesayeed thoughts on the failing test?
@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?
I am trying to look into it. If it was failing for all Rubies, I would have though there might be some edge condition on how numbers are stored in Redis.
It might be an issue in the Redis Ruby gem
I think the issue might be on another level. Because even the expected value changes of each run. That expected value is calculated by generating a hash digest of the array of the results and scores concatenated. If the array is somehow changing the order or the digest calculation is not reliable to test the equality then we might need to make some changes there.
I think I got the culprit, just need to dig deeper and see how it goes.
@ibnesayeed good catch
The issue is related to how classify_with_score
method returns the value.
def classify_with_score(text)
(classifications(text).sort_by { |a| -a[1] })[0]
end
This returns the first value with the highest score. However, if more than one categories have the same score, how to resolve the conflict? In this particular test case, there is an instance in the dataset on like 4499, there is a record ham Ok
which is being used as test. Due to the only-stopword string, both teh categories are assigned Infinity
value, and the method might return one or the other as the chosen one.
If the test uses a threshold, the situation will be little different as discussed in #123.
However, should have a fix for this situation without making any changes in the library code, only in the test. I will let you know those changes in a bit.
awesome, thanks!
I was in rush and could not make proper PR on time. PR #133 should take care of the error.
It was interesting to find out that the record that was nothing but a stopword was paced at line 4499. We are using records 1-4000 for training and 4001-5000 for testing and it happened to be the second last record on the test set. This was coincidental, but a good one that made the test more reliable.
yeah, that was a good catch
Also, using
word_hash.empty?
instead ofword_hash.length == 0
would be more elegant in my opinion.