don't train or untrain empty word hashes

jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.

https://jekyll.github.io/classifier-reborn/

GNU Lesser General Public License v2.1

554 stars 110 forks source link

don't train or untrain empty word hashes #132

Closed Ch4s3 closed 7 years ago

ibnesayeed commented 7 years ago

Also, using word_hash.empty? instead of word_hash.length == 0 would be more elegant in my opinion.

ibnesayeed commented 7 years ago

Additionally, the classifications method should return a score hash with Infinity value for each category if the word_hash is empty like this:


def classifications(text)
  score = {}
  word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
  if word_hash.empty
    category_keys.each do |category|
      score[category.to_s] = Float::INFINITY
    end
    return score
  end
  #... remaining code
end

ibnesayeed commented 7 years ago

I had these changes in a local branch on my code base, but since you are taking care of it, I will just discard that.

Ch4s3 commented 7 years ago

So I made these changes, but it breaks this test .

10.times do |a_number|
    assert_equal 'Digit', b.classify(a_number.to_s)
end

But, I'm not sure it even makes sense to classify a single digit that way.

ibnesayeed commented 7 years ago

So I made these changes, but it breaks this test .

This is happening because how the tokenization is implemented currently. If the length of the token is not greater than 2 letters then it is skipped and not added in the word_hash. I think it was a meaningless test because there was only one category present and the system was returning that category with some score no matter what was passed to classify. Those categories were actually not being trained with anything.

In the new implementation of the classification method, with empty strings getting Infinity score, and a threshold is set, the classify method would return nil.

next unless word.length > 2 && !STOPWORDS[language].include?(word)

In conclusion, we need to change the test so that it trains for numbers above 100. This should make the test pass. Then we need to think if we want the limit of length where the token is nothing but a number such as 2, 20, or .5. This makes it more important to be able to pass a custom tokenizer (#131) for situations like this.

Ch4s3 commented 7 years ago

I changed the test to use words, which I think better reflects the intent, and I tweaked the threshold. But in general I agree that it was a bad test, we seem to have a lot of those.

Yeah, let's prioritize #131, it will be pretty useful going forward.

Ch4s3 commented 7 years ago

@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?

The results of classification_scores(@memory_classifier) and classification_scores(@redis_classifier) look nearly identical but the hashes are different. I'm guessing this is a floating point math issue, but I'm not sure.

ibnesayeed commented 7 years ago

I changed the test to use words, which I think better reflects the intent...

You might want to make it something like 100.upto(110) do ....

Ch4s3 commented 7 years ago

@ibnesayeed thoughts on the failing test?

ibnesayeed commented 7 years ago

@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?

I am trying to look into it. If it was failing for all Rubies, I would have though there might be some edge condition on how numbers are stored in Redis.

Ch4s3 commented 7 years ago

It might be an issue in the Redis Ruby gem

ibnesayeed commented 7 years ago

I think the issue might be on another level. Because even the expected value changes of each run. That expected value is calculated by generating a hash digest of the array of the results and scores concatenated. If the array is somehow changing the order or the digest calculation is not reliable to test the equality then we might need to make some changes there.

ibnesayeed commented 7 years ago

I think I got the culprit, just need to dig deeper and see how it goes.

Ch4s3 commented 7 years ago

@ibnesayeed good catch

ibnesayeed commented 7 years ago

The issue is related to how classify_with_score method returns the value.

def classify_with_score(text)
  (classifications(text).sort_by { |a| -a[1] })[0]
end

This returns the first value with the highest score. However, if more than one categories have the same score, how to resolve the conflict? In this particular test case, there is an instance in the dataset on like 4499, there is a record ham Ok which is being used as test. Due to the only-stopword string, both teh categories are assigned Infinity value, and the method might return one or the other as the chosen one.

ibnesayeed commented 7 years ago

If the test uses a threshold, the situation will be little different as discussed in #123.

ibnesayeed commented 7 years ago

However, should have a fix for this situation without making any changes in the library code, only in the test. I will let you know those changes in a bit.

Ch4s3 commented 7 years ago

awesome, thanks!

ibnesayeed commented 7 years ago

Please cherry pick https://github.com/ibnesayeed/classifier-reborn/pull/1/commits/ed116ea42aeeaa736d6245c483ae91ea25663a99

ibnesayeed commented 7 years ago

I was in rush and could not make proper PR on time. PR #133 should take care of the error.

ibnesayeed commented 7 years ago

It was interesting to find out that the record that was nothing but a stopword was paced at line 4499. We are using records 1-4000 for training and 4001-5000 for testing and it happened to be the second last record on the test set. This was coincidental, but a good one that made the test more reliable.

Ch4s3 commented 7 years ago

yeah, that was a good catch