Closed hakehuang closed 5 years ago
The code could fix this specific issue (I haven't checked to be sure), but would break other code. That line was used to filter out words that have 2 or fewer characters...and while I'm not quite sure why it does this filtering, I'm afraid that the LSI might fail horribly when handling very small words. Current automated tests are failing since they are dependent on the current filtering behavior. If we can figure out why the previous programmers caused the "small words" to be filtered, we then can decide whether it is possible to add an exception that will allow us to accept digits.
In any event, I would suggest writing a new automated test for handling edge cases where numerical digits matter, so that we don't accidentally reintroduce the same behavior in the future...while also making sure all previous automated tests pass as well.
The LSI will in fact fail horribly with a NaN/NaN error if you remove this filter.
can you give me some test examples? @Ch4s3 . I have fixed the unit test issues. the 1 byte judgement is a real user case in my application, and I believe this requirement is universal
let me take a look tonight
could you better describe your use case @hakehuang?
I want to classify my build log, which usually appears as below:
Error: 0 means there are no error
Error:
2017-03-02 13:14 GMT+08:00 Chase Gilliam notifications@github.com:
could you better describe your use case @hakehuang https://github.com/hakehuang?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-283558623, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xoZ1FvUTftzmVQZQ51LuHLIKirQEks5rhlBRgaJpZM4MN5bP .
It seems like you could do that more reliably with a regex or simple string match.
yes o no,
some times the string goes this way: Errors is: 0 Errors for this is : 3
it is very difficult to use a regex to match the diferences.
2017-03-03 0:59 GMT+08:00 Chase Gilliam notifications@github.com:
It seems like you could do that more reliably with a regex or simple string match.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-283712663, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xt7BJM4F7H2-JzyqbH3FM9Ou2ZKmks5rhvWMgaJpZM4MN5bP .
I'm still not sure LSI is correct. Have you tried the Bayesian classifier? You can set it up not to use stop words. However if I were you, I would just write a simple parser and match on the number.
You could also use scan
foo = "Errors is: 0"
bar = "Errors for this is : 3"
foo_num = foo.scan.scan(/\d/)
bar_num = bar.scan(/\d/)
there are many of such patterns, below I just list a few. and all those errors are mixed in a log, with many human readable context for debugging purpose. My idea is to have a log parser, which can classify all the error types, and give me a summary of all. I tried Bayesian and Naive Bayes, which works, but only LSI can give me a search function.
undefined symbol
undefined reference to
not defined
not define
java.lang.Exception: java.lang.InterruptedException
no definition for
enumeration value is out of
identifier is undefined
defined but not used
not fit in region
invalid operands to binary | (have 'int' and 'void *')
unable to allocate space for sections/blocks with a total estimated minimum size
with offset out of bounds
error loading bundle activator
no such file or directory
cannot be found
passing arg n of makes pointer from integer without a cast
was unable to load
exceeds the maximum allowed for
cannot open source file
cannot find source file
cannot fit into
not allowed
not facet-valid with respect to pattern
can not open
pointless integer comparison, the result is always false
cannot call
cannot be assigned to
cannot call intrinsic function
a function call cannot appear in a constant-expression
too few arguments in function call
was not declared in this scope
may be used uninitialized in this function
interact script return value
first use in this function
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
board.c(60) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
board.c(60) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
MKV58F24.h(326) : Fatal Error[Pe1696]: cannot open source file "MKV58F24.h(326) : Fatal Error[Pe1696]: cannot open source file "FreeRTOS.h(98) : Fatal Error[Pe1696]: cannot open source file "FreeRTOSConfig.h"
fsl_flash.h(68) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
Since I found your use case interesting, I decided to try to replicate the original case, except that it...er...works.
lsi = ClassifierReborn::LSI.new
lsi.add_item 'log message Error: 1', :Error
lsi.add_item 'log message Error: 0', :Pass
lsi.classify 'log message Error: 1'
#=> :Pass
Obviously, it's giving us the wrong answer, and looking at the LSI object suggests that it is due to the program ignoring one-character objects (digits) and not including them in the word_hash
es:
=> #<ClassifierReborn::LSI:0x007f7f79980828
@auto_rebuild=true,
@built_at_version=2,
@cache_node_vectors=nil,
@items=
{"log message Error: 1"=>
#<ClassifierReborn::ContentNode:0x007f7f79972200
@categories=[:Error],
@lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@word_hash={:log=>1, :messag=>1, :error=>1}>,
"log message Error: 0"=>
#<ClassifierReborn::ContentNode:0x007f7f799713c8
@categories=[:Pass],
@lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@word_hash={:log=>1, :messag=>1, :error=>1}>},
@language="en",
@version=2,
@word_list=
#<ClassifierReborn::WordList:0x007f7f799711c0
@location_table={:log=>0, :messag=>1, :error=>2}>>
So there's still that issue to deal with.
But we also have another issue at play. It's working fine on my machine while it's crashing on yours. My hypothesis for why it's crashing is based on the specific error message
D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58
You are using vector.rb because you do not have the GSL and the the GSL Ruby Gem (to interface with the GSL) installed. Basically, if you don't have GSL on your computer, we load up our own (slower) scientific calculation library instead, which included the file "vector.rb". So there must be a bug within classifier-reborn's vector.rb file that is causing this specific error message to occur. According to the docs though, it is recommended that you install GSL, since it will make LSI "at least 10x" faster, so if you plan on using LSI, I would suggest you set up GSL on your local machine.
If you plan on not installing GSL, well...Unfortunately, I don't know enough about SVD to feel confident about debugging it. @Ch4s3, do you feel confident?
@tra38, no unfortunately our SVD function was not super well implemented, and is a bit beyond my ability with linear algebra to fix. I intend to replace it with a native ext at some point.
the Bayesian classifier has some other issue for my cases, which I am trying to debugging now. I drop some hot fixes of mine. with this fix, the bayes clasifier seems works fine for my cases.
diff --git a/lib/classifier-reborn/bayes.rb b/lib/classifier-reborn/bayes.rb index 3d5bbf1..d658856 100644 --- a/lib/classifier-reborn/bayes.rb +++ b/lib/classifier-reborn/bayes.rb @@ -126,16 +126,23 @@ module ClassifierReborn end return score end
category_keys.each do |category| score[category.to_s] = 0
@backend.category_training_count(category) : @backend.total_trainings.to_f
@backend.total_trainings.to_f)
end score end
2017-03-04 2:44 GMT+08:00 Chase Gilliam notifications@github.com:
I'm still not sure LSI is correct. Have you tried the Bayesian classifier? You can set it up not to use stop words. However if I were you, I would just write a simple parser and match on the number.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-284036334, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xo7o2ye8JebwEHSTx4Ci3yji2mviks5riF-igaJpZM4MN5bP .
the SVD seems a big challenge for all AI users, do you know any Ruby solutions for this? using a LAPACK backend seems not that good for cloud deployment.
2017-03-06 12:39 GMT+08:00 Chase Gilliam notifications@github.com:
@tra38 https://github.com/tra38, no unfortunately our SVD function was not super well implemented, and is a bit beyond my ability with linear algebra to fix. I intend to replace it with a native ext at some point.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-284301344, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xjcb2dzee-P1PWrbhTUPfzla6v4Iks5ri44bgaJpZM4MN5bP .
There aren't any good pure Ruby implementations that I'm aware of.
I am also having issues using LSI on small words, with Math::DomainError
being raised. I skip training those words as my current solution. For background, the corpus I am using are directly pulled from credit card compliance information (e.g. has dollar amounts, random -
characters, etc).
Just came across this same issue... I know it's not a long term solution, but since I'm just evaluating this project, instead of skipping the small words, I created a hack function to just go around the problem, for now.
def fixhack (text) text.split(" ").map! {|w| w.size < 3 ? w+"" : w}.join(" ") end
and then, I just wrap every mention of the content during training and classification. e.g.
lsi = ClassifierReborn::LSI.new lsi.add_item fix_hack("This is a test"), "test" ... c, s = lsi.classify_with_score fix_hack("It is a test")
For me, brew install gsl
and adding the GSL dependency:
gem 'classifier-reborn' # lets get machine learning!
gem 'gsl', '~> 2.1', '>= 2.1.0.3'
has solved the sqrt issue and the other NaN issue, I think!
@epugh Have you tried with small words ~3-4 chars in length?
Yep, and with those, I just get a warning message, the code runs.
Here is my test set:
strings = [["This text deals with dogs. Dogs.", :dog],
["This text involves dogs too. Dogs!", :dog],
["LOOKING FOR SPEAKER", :missing],
["Need speaker!", :missing],
["Need speakers!", :missing],
["n/a OSC Retreat.", :missing],
["na", :missing],
["spearks are needed", :missing],
["Matt Datastax.", :present]]
strings.each { |x| classifier.add_item x.first, x.last }
assert_same :missing, (classifier.classify ("speaker needed"))
assert_not_same :missing, (classifier.classify ("Matt Overstreet Solr Stemmers"))
assert_same :present, (classifier.classify ("Matt Overstreet Solr Stemmers"))
So the "na" gives an error, and previously before I installed gsl, the "n/a" blew up!
Unfortunately that's expected behavior, but not the desired behavior. Out plain ruby lsi implementation is pretty broken, and I lack the math background necessary to fix it.
I wonder if the best path is to say "You must have GSL installed"? I;e accept the plain ruby issues...
@epugh unfortunately we're a dependency of Jekyll, so we want to have a ruby only option to make it more accessible. However, for any sort of prod use beyond that, we strongly endorse GSL.
below is my scripts
trace log
I find the issue can be fixed with below change, please help to review
https://github.com/jekyll/classifier-reborn/pull/154