jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
550 stars 109 forks source link

Weird behavior when one category is empty #47

Closed bararchy closed 8 years ago

bararchy commented 8 years ago

So, this is an example

[2] pry(main)> require 'classifier-reborn'
=> true
[3] pry(main)> number_finder = ClassifierReborn::Bayes.new 'a_number', 'not_a_number'
=> #<ClassifierReborn::Bayes:0x00000002d1d330 @categories={:"A number"=>{}, :"Not a number"=>{}}, @category_counts={}, @category_word_count={}, @total_words=0>
[4] pry(main)> number_finder.train_a_number('1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20')
=> ["1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20"]
[5] pry(main)> number_finder
=> #<ClassifierReborn::Bayes:0x00000002d1d330 @categories={:"A number"=>{}, :"Not a number"=>{}}, @category_counts={:"A number"=>1}, @category_word_count={:"A number"=>0}, @total_words=0>
[6] pry(main)> number_finder.train_a_number('2')
=> ["2"]
[7] pry(main)> number_finder.train_a_number('3')
=> ["3"]
[8] pry(main)> number_finder.train_a_number('4')
=> ["4"]
[9] pry(main)> number_finder.train_a_number('5')
=> ["5"]
# More lines...
[11] pry(main)> number_finder.train_a_number('one')
=> ["one"]
[12] pry(main)> number_finder.train_a_number('two')
=> ["two"]
[13] pry(main)> number_finder.classify("5")
=> "A number"
[14] pry(main)> number_finder.classify("15 and more numbers")
=> "A number"
[15] pry(main)> number_finder.classify("numbers")
=> "A number"
[16] pry(main)> number_finder.classify("lol wut ?")
=> "A number"
[17] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "A number"
[18] pry(main)> number_finder.classify("")
=> "A number"

After training not_a_number it works as expected

[21] pry(main)> number_finder.train_not_a_number("?")
=> ["?"]
[22] pry(main)> number_finder.train_not_a_number("!")
=> ["!"]
[23] pry(main)> number_finder.train_not_a_number("a b c d e f g h i j k l m n o p")
=> ["a b c d e f g h i j k l m n o p"]
[24] pry(main)> number_finder.train_not_a_number("those are also not numbers")
=> ["those are also not numbers"]
[25] pry(main)> number_finder.classify("")
=> "A number"
[26] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "Not a number"

I'm new to the idea of classifiers so maybe this is intentional, just looks strange to me.

parkr commented 8 years ago

Very interesting! The fact that the classifier doesn't have a competing classification may be the root cause. When not_a_number is empty, the classifier only has one classification that it could possibly pick – null is not an option, and even if the match is some infinitesimal number or 0 itself, it's the only classification we know about so match it. Solution would be to say match coefficient must be > 0 – even if infinitesimal – and if the match coefficient is 0, then return null. It's a breaking change of the API, though, so maybe we could just return "".

[13] pry(main)> number_finder.classify("5")
=> "A number"
[14] pry(main)> number_finder.classify("15 and more numbers")
=> "A number"
[15] pry(main)> number_finder.classify("numbers")
=> "A number"
[16] pry(main)> number_finder.classify("lol wut ?")
=> "A number"
[17] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "A number"
[18] pry(main)> number_finder.classify("")
=> "A number"
bararchy commented 8 years ago

@parkr Will it be too bad to say something like: "if value is not classified in category 'a', then, even though category 'b' is empty, it belongs there" ?

bararchy commented 8 years ago

Maybe if I give an example of my usage it will be easier to understand my need to leave one category empty.

I'm using this gem to learn HTTP Traffic, I'm setting it to "Training Mode" and show it what "normal traffic" looks like, then, I want it to check traffic and if the classification isn't "normal traffic" then it is "suspicious traffic" and I drop the packet.

Right now it only works if I show it what "suspicious traffic" looks like, but this creates kind of a 'black list' situation, and I want more of a 'white list' approach

Ch4s3 commented 8 years ago

I'm just getting back from vacation I'll take a look soon

bararchy commented 8 years ago

@Ch4s3 Hi, did you manage to see what's going on here ?

Ch4s3 commented 8 years ago

Sorry, I got bogged down catching up at work. It's on my radar though.

MadBomber commented 8 years ago

This is akin to a "none of the above" kind of classification where given a set of categories if the best fit is less than some threshold then a result indicating "none of the above" or :unknown is returned.

bararchy commented 8 years ago

@MadBomber Good point, this is exactly the answer I was looking for from the classifier :) I hope it could be implemented.

MadBomber commented 8 years ago

Take a look at my fork to see if this is what you had in mind.

https://github.com/MadBomber/classifier-reborn/commit/509633435489f61b0fb41313533e9855bfd904c0 https://github.com/MadBomber/classifier-reborn/commit/509633435489f61b0fb41313533e9855bfd904c0

I am not 100% sure this is a good solution to what you want to do. I'm thinking that there will be a large number of false positives. You may find yourself spending more time adjusting the threshold value.

I will submit a pull request after you play with it for a while.

Dewayne o-*

On Oct 19, 2015, at 10:17 AM, Bar Hofesh notifications@github.com wrote:

@MadBomber https://github.com/MadBomber Good point, this is exactly the answer I was looking for from the classifier :) I hope it could be implemented.

— Reply to this email directly or view it on GitHub https://github.com/jekyll/classifier-reborn/issues/47#issuecomment-149246697.

MadBomber commented 8 years ago

In your pry session I think that if you had used #classify_with_score you would have seen that the score was being returned as Float::INIFINITY for text that was not classified as 'a_number'

bararchy commented 8 years ago

@MadBomber I just tried your version, again, only training the "normal activity" category. This is what I do:

ai_overlord = ClassifierReborn::Bayes.new 'normal_activity', 'suspicious_activity', {:enable_threshold => true}
=> #<ClassifierReborn::Bayes:0x0000000210ec38
 @auto_categorize=false,
 @categories={:"Normal activity"=>{}, :"Suspicious activity"=>{}},
 @category_counts={},
 @category_word_count={},
 @enable_threshold=true,
 @language="en",
 @threshold=0.0,
 @total_words=0>

### Training the classifier 
ai_overlord.train_normal_activity("Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept" * 1000)

[35] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[36] pry(main)> ai_overlord.classify_with_score("GET / POST")
=> ["Suspicious activity", Infinity]

## Trying to play around with threshold

23] pry(main)> ai_overlord.threshold = 0.5
=> 0.5
[24] pry(main)> ai_overlord.classify("GET / ")
=> nil
[25] pry(main)> ai_overlord.threshold = 10.0
=> 10.0
[26] pry(main)> ai_overlord.classify("GET / ")
=> nil
[27] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[28] pry(main)> ai_overlord.threshold = 50.0
=> 50.0
[29] pry(main)> ai_overlord.classify("GET / POST")
=> nil

### Making sure the Threshold is changed inside the class
[37] pry(main)> ai_overlord
=> #<ClassifierReborn::Bayes:0x0000000210ec38
 @auto_categorize=false,
 @categories={:"Normal activity"=>{:firefox=>1, :chrome=>1000, :mozzila=>1000, :get=>1000, :post=>1000, :http=>1000, :acceptfirefox=>999, :accept=>1, :/=>1000, :"."=>3000}, :"Suspicious activity"=>{}},
 @category_counts={:"Normal activity"=>1},
 @category_word_count={:"Normal activity"=>10001},
 @enable_threshold=true,
 @language="en",
 @threshold=50.0,
 @total_words=10001>

I seems that again when one category is empty it would always classify to the empty one, would the threshold feature help in this case ?

Thanks :)

MadBomber commented 8 years ago

Given your examples, that is proper behavior. You trained the classifier with only one example - a very long string. You asked it to classify a very short string. It rejected the string showing a score of Infinity which means that there is no matching category.

Try this pattern:

Notice there is only one category: Normal

ai_overlord = ClassifierReborn::Bayes.new( 'Normal', enable_threshold: true )

normal_request = "Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept"

10.times { |x| ai_overlord.train_normal(normal_request) }

Dynamically set the threshold to less than a known sample

ai_overlord.threshold = ai_overlord.classify_with_score(normal_request).last - 0.5

ai_overlord.classify(normal_request)

Now try to classify a counter-example

abnormal_request = "Safari opera webkit GET POST / http 1.1 1.0 1.2 Accept" ai_overlord.classify( abnormal_request )

o-*

On Oct 20, 2015, at 2:26 AM, Bar Hofesh notifications@github.com wrote:

@MadBomber https://github.com/MadBomber I just tried your version, again, only training the "normal activity" category. This is what I do:

ai_overlord = ClassifierReborn::Bayes.new 'normal_activity', 'suspicious_activity', {:enable_threshold => true} => #<ClassifierReborn::Bayes:0x0000000210ec38 @auto_categorize=false, @categories={:"Normal activity"=>{}, :"Suspicious activity"=>{}}, @category_counts={}, @category_word_count={}, @enable_threshold=true, @language="en", @threshold=0.0, @total_words=0>

Training the classifier

ai_overlord.train_normal_activity("Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept" * 1000)

[35] pry(main)> ai_overlord.classify("GET / POST") => nil [36] pry(main)> ai_overlord.classify_with_score("GET / POST") => ["Suspicious activity", Infinity]

Trying to play around with threshold

23] pry(main)> ai_overlord.threshold = 0.5 => 0.5 [24] pry(main)> ai_overlord.classify("GET / ") => nil [25] pry(main)> ai_overlord.threshold = 10.0 => 10.0 [26] pry(main)> ai_overlord.classify("GET / ") => nil [27] pry(main)> ai_overlord.classify("GET / POST") => nil [28] pry(main)> ai_overlord.threshold = 50.0 => 50.0 [29] pry(main)> ai_overlord.classify("GET / POST") => nil

Making sure the Threshold is changed inside the class

[37] pry(main)> ai_overlord => #<ClassifierReborn::Bayes:0x0000000210ec38 @auto_categorize=false, @categories={:"Normal activity"=>{:firefox=>1, :chrome=>1000, :mozzila=>1000, :get=>1000, :post=>1000, :http=>1000, :acceptfirefox=>999, :accept=>1, :/=>1000, :"."=>3000}, :"Suspicious activity"=>{}}, @category_counts={:"Normal activity"=>1}, @category_word_count={:"Normal activity"=>10001}, @enable_threshold=true, @language="en", @threshold=50.0, @total_words=10001> — Reply to this email directly or view it on GitHub https://github.com/jekyll/classifier-reborn/issues/47#issuecomment-149461684.

bararchy commented 8 years ago

So, @MadBomber just wanted to ask, is there a way to add a "none of the above" option ? this way I can have a default "none of the above" value, and then I can only train one classification.

MadBomber commented 8 years ago

On Oct 27, 2015, at 3:42 AM, Bar Hofesh notifications@github.com wrote:

So, @MadBomber https://github.com/MadBomber just wanted to ask, is there a way to add a "none of the above" option ? this way I can have a default "none of the above" value, and then I can only train one classification.

The feature has been merged with the master but a new version of the gem has not yet been published. You can look at the code to see the details:

https://github.com/jekyll/classifier-reborn/blob/master/lib/classifier-reborn/bayes.rb https://github.com/jekyll/classifier-reborn/blob/master/lib/classifier-reborn/bayes.rb

Here is the gist:

1) it only works with the classify method. All other methods behave as before. If the result falls below a threshold score or the score is INFINITY the result returned will be nil. So to see if it was "none of the above" just check for result.nil?

2) you can enable the threshold at initialization time with the option 'enable_threshold' set to true. You can also enable/disable threshold process at any time using the methods enable_threshold and disable_threshold.

3) The default threshold is 0.0 any score below this will return a nil result; HOWEVER, threshold that you should use is one that makes sense for your application. You can set your own threshold at initialization time with the option 'threshold' which expects a floating point number. You can reset the threshold or get its value using the methods 'threshold=' or just 'threshold'

Check out the unit tests:

https://github.com/jekyll/classifier-reborn/blob/master/test/bayes/bayesian_test.rb https://github.com/jekyll/classifier-reborn/blob/master/test/bayes/bayesian_test.rb

The test at line 82 'test_classification_with_threshold_again' is your specific scenario as I understood it.

Lets us know if you catch any bad guys using this technique.

Dewayne o-*

bararchy commented 8 years ago

@MadBomber Thanks for the great explanation and the example in the tests.

Right now I used a -200.0 threshold to stop a SQL Injection attack from SQLMAP.

I need to play around with letting the classifier learn more, then, test a few attacks. anyhow this is fine for my use case, many thanks (also it's stable, I would push a new gem version ;) ) I'll update if necessary, issue closed :)

Ch4s3 commented 8 years ago

@MadBomber and @bararchy I'm going to try to release a new version soon. I basically jut need to get some stuff on the readme about the new features.

MadBomber commented 8 years ago

I will add a section to the README on the threshold features. Should have a pull request in by tonight.

o-*

On Oct 27, 2015, at 10:54 PM, Chase Gilliam <notifications@github.com mailto:notifications@github.com> wrote:

@MadBomber https://github.com/MadBomber and @bararchy https://github.com/bararchy I'm going to try to release a new version soon. I basically jut need to get some stuff on the readme about the new features.

— Reply to this email directly or view it on GitHub https://github.com/jekyll/classifier-reborn/issues/47#issuecomment-151715108.