Closed bararchy closed 8 years ago
Very interesting! The fact that the classifier doesn't have a competing classification may be the root cause. When not_a_number
is empty, the classifier only has one classification that it could possibly pick – null
is not an option, and even if the match is some infinitesimal number or 0 itself, it's the only classification we know about so match it. Solution would be to say match coefficient must be > 0
– even if infinitesimal – and if the match coefficient is 0, then return null
. It's a breaking change of the API, though, so maybe we could just return ""
.
[13] pry(main)> number_finder.classify("5")
=> "A number"
[14] pry(main)> number_finder.classify("15 and more numbers")
=> "A number"
[15] pry(main)> number_finder.classify("numbers")
=> "A number"
[16] pry(main)> number_finder.classify("lol wut ?")
=> "A number"
[17] pry(main)> number_finder.classify("Is this a Bug ? ")
=> "A number"
[18] pry(main)> number_finder.classify("")
=> "A number"
@parkr Will it be too bad to say something like: "if value is not classified in category 'a', then, even though category 'b' is empty, it belongs there" ?
Maybe if I give an example of my usage it will be easier to understand my need to leave one category empty.
I'm using this gem to learn HTTP Traffic, I'm setting it to "Training Mode" and show it what "normal traffic" looks like, then, I want it to check traffic and if the classification isn't "normal traffic" then it is "suspicious traffic" and I drop the packet.
Right now it only works if I show it what "suspicious traffic" looks like, but this creates kind of a 'black list' situation, and I want more of a 'white list' approach
I'm just getting back from vacation I'll take a look soon
@Ch4s3 Hi, did you manage to see what's going on here ?
Sorry, I got bogged down catching up at work. It's on my radar though.
This is akin to a "none of the above" kind of classification where given a set of categories if the best fit is less than some threshold then a result indicating "none of the above" or :unknown is returned.
@MadBomber Good point, this is exactly the answer I was looking for from the classifier :) I hope it could be implemented.
Take a look at my fork to see if this is what you had in mind.
https://github.com/MadBomber/classifier-reborn/commit/509633435489f61b0fb41313533e9855bfd904c0 https://github.com/MadBomber/classifier-reborn/commit/509633435489f61b0fb41313533e9855bfd904c0
I am not 100% sure this is a good solution to what you want to do. I'm thinking that there will be a large number of false positives. You may find yourself spending more time adjusting the threshold value.
I will submit a pull request after you play with it for a while.
Dewayne o-*
On Oct 19, 2015, at 10:17 AM, Bar Hofesh notifications@github.com wrote:
@MadBomber https://github.com/MadBomber Good point, this is exactly the answer I was looking for from the classifier :) I hope it could be implemented.
— Reply to this email directly or view it on GitHub https://github.com/jekyll/classifier-reborn/issues/47#issuecomment-149246697.
In your pry session I think that if you had used #classify_with_score you would have seen that the score was being returned as Float::INIFINITY for text that was not classified as 'a_number'
@MadBomber I just tried your version, again, only training the "normal activity" category. This is what I do:
ai_overlord = ClassifierReborn::Bayes.new 'normal_activity', 'suspicious_activity', {:enable_threshold => true}
=> #<ClassifierReborn::Bayes:0x0000000210ec38
@auto_categorize=false,
@categories={:"Normal activity"=>{}, :"Suspicious activity"=>{}},
@category_counts={},
@category_word_count={},
@enable_threshold=true,
@language="en",
@threshold=0.0,
@total_words=0>
### Training the classifier
ai_overlord.train_normal_activity("Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept" * 1000)
[35] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[36] pry(main)> ai_overlord.classify_with_score("GET / POST")
=> ["Suspicious activity", Infinity]
## Trying to play around with threshold
23] pry(main)> ai_overlord.threshold = 0.5
=> 0.5
[24] pry(main)> ai_overlord.classify("GET / ")
=> nil
[25] pry(main)> ai_overlord.threshold = 10.0
=> 10.0
[26] pry(main)> ai_overlord.classify("GET / ")
=> nil
[27] pry(main)> ai_overlord.classify("GET / POST")
=> nil
[28] pry(main)> ai_overlord.threshold = 50.0
=> 50.0
[29] pry(main)> ai_overlord.classify("GET / POST")
=> nil
### Making sure the Threshold is changed inside the class
[37] pry(main)> ai_overlord
=> #<ClassifierReborn::Bayes:0x0000000210ec38
@auto_categorize=false,
@categories={:"Normal activity"=>{:firefox=>1, :chrome=>1000, :mozzila=>1000, :get=>1000, :post=>1000, :http=>1000, :acceptfirefox=>999, :accept=>1, :/=>1000, :"."=>3000}, :"Suspicious activity"=>{}},
@category_counts={:"Normal activity"=>1},
@category_word_count={:"Normal activity"=>10001},
@enable_threshold=true,
@language="en",
@threshold=50.0,
@total_words=10001>
I seems that again when one category is empty it would always classify to the empty one, would the threshold feature help in this case ?
Thanks :)
Given your examples, that is proper behavior. You trained the classifier with only one example - a very long string. You asked it to classify a very short string. It rejected the string showing a score of Infinity which means that there is no matching category.
Try this pattern:
ai_overlord = ClassifierReborn::Bayes.new( 'Normal', enable_threshold: true )
normal_request = "Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept"
10.times { |x| ai_overlord.train_normal(normal_request) }
ai_overlord.threshold = ai_overlord.classify_with_score(normal_request).last - 0.5
ai_overlord.classify(normal_request)
abnormal_request = "Safari opera webkit GET POST / http 1.1 1.0 1.2 Accept" ai_overlord.classify( abnormal_request )
o-*
On Oct 20, 2015, at 2:26 AM, Bar Hofesh notifications@github.com wrote:
@MadBomber https://github.com/MadBomber I just tried your version, again, only training the "normal activity" category. This is what I do:
ai_overlord = ClassifierReborn::Bayes.new 'normal_activity', 'suspicious_activity', {:enable_threshold => true} => #<ClassifierReborn::Bayes:0x0000000210ec38 @auto_categorize=false, @categories={:"Normal activity"=>{}, :"Suspicious activity"=>{}}, @category_counts={}, @category_word_count={}, @enable_threshold=true, @language="en", @threshold=0.0, @total_words=0>
Training the classifier
ai_overlord.train_normal_activity("Firefox chrome mozzila GET POST / http 1.1 1.0 1.2 Accept" * 1000)
[35] pry(main)> ai_overlord.classify("GET / POST") => nil [36] pry(main)> ai_overlord.classify_with_score("GET / POST") => ["Suspicious activity", Infinity]
Trying to play around with threshold
23] pry(main)> ai_overlord.threshold = 0.5 => 0.5 [24] pry(main)> ai_overlord.classify("GET / ") => nil [25] pry(main)> ai_overlord.threshold = 10.0 => 10.0 [26] pry(main)> ai_overlord.classify("GET / ") => nil [27] pry(main)> ai_overlord.classify("GET / POST") => nil [28] pry(main)> ai_overlord.threshold = 50.0 => 50.0 [29] pry(main)> ai_overlord.classify("GET / POST") => nil
Making sure the Threshold is changed inside the class
[37] pry(main)> ai_overlord => #<ClassifierReborn::Bayes:0x0000000210ec38 @auto_categorize=false, @categories={:"Normal activity"=>{:firefox=>1, :chrome=>1000, :mozzila=>1000, :get=>1000, :post=>1000, :http=>1000, :acceptfirefox=>999, :accept=>1, :/=>1000, :"."=>3000}, :"Suspicious activity"=>{}}, @category_counts={:"Normal activity"=>1}, @category_word_count={:"Normal activity"=>10001}, @enable_threshold=true, @language="en", @threshold=50.0, @total_words=10001> — Reply to this email directly or view it on GitHub https://github.com/jekyll/classifier-reborn/issues/47#issuecomment-149461684.
So, @MadBomber just wanted to ask, is there a way to add a "none of the above" option ? this way I can have a default "none of the above" value, and then I can only train one classification.
On Oct 27, 2015, at 3:42 AM, Bar Hofesh notifications@github.com wrote:
So, @MadBomber https://github.com/MadBomber just wanted to ask, is there a way to add a "none of the above" option ? this way I can have a default "none of the above" value, and then I can only train one classification.
The feature has been merged with the master but a new version of the gem has not yet been published. You can look at the code to see the details:
https://github.com/jekyll/classifier-reborn/blob/master/lib/classifier-reborn/bayes.rb https://github.com/jekyll/classifier-reborn/blob/master/lib/classifier-reborn/bayes.rb
Here is the gist:
1) it only works with the classify method. All other methods behave as before. If the result falls below a threshold score or the score is INFINITY the result returned will be nil. So to see if it was "none of the above" just check for result.nil?
2) you can enable the threshold at initialization time with the option 'enable_threshold' set to true. You can also enable/disable threshold process at any time using the methods enable_threshold and disable_threshold.
3) The default threshold is 0.0 any score below this will return a nil result; HOWEVER, threshold that you should use is one that makes sense for your application. You can set your own threshold at initialization time with the option 'threshold' which expects a floating point number. You can reset the threshold or get its value using the methods 'threshold=' or just 'threshold'
Check out the unit tests:
https://github.com/jekyll/classifier-reborn/blob/master/test/bayes/bayesian_test.rb https://github.com/jekyll/classifier-reborn/blob/master/test/bayes/bayesian_test.rb
The test at line 82 'test_classification_with_threshold_again' is your specific scenario as I understood it.
Lets us know if you catch any bad guys using this technique.
Dewayne o-*
@MadBomber Thanks for the great explanation and the example in the tests.
Right now I used a -200.0 threshold to stop a SQL Injection attack from SQLMAP.
I need to play around with letting the classifier learn more, then, test a few attacks. anyhow this is fine for my use case, many thanks (also it's stable, I would push a new gem version ;) ) I'll update if necessary, issue closed :)
@MadBomber and @bararchy I'm going to try to release a new version soon. I basically jut need to get some stuff on the readme about the new features.
I will add a section to the README on the threshold features. Should have a pull request in by tonight.
o-*
On Oct 27, 2015, at 10:54 PM, Chase Gilliam <notifications@github.com mailto:notifications@github.com> wrote:
@MadBomber https://github.com/MadBomber and @bararchy https://github.com/bararchy I'm going to try to release a new version soon. I basically jut need to get some stuff on the readme about the new features.
— Reply to this email directly or view it on GitHub https://github.com/jekyll/classifier-reborn/issues/47#issuecomment-151715108.
So, this is an example
After training
not_a_number
it works as expectedI'm new to the idea of classifiers so maybe this is intentional, just looks strange to me.