jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
554 stars 110 forks source link

the lsi meets `sqrt': Numerical argument is out of domain - "sqrt" (Math::DomainError) #153

Closed hakehuang closed 5 years ago

hakehuang commented 7 years ago

below is my scripts

lsi = ClassifierReborn::LSI.new
lsi.add_item 'log message Error: 1', :Error
lsi.add_item 'log message Error: 0', :Pass

result  = lsi.classify 'log message Error: 1'

trace log


D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58:in `sqrt': Numerical argument is out of domain - "sqrt" (Math::DomainError)
    from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58:in `block in SV_decomp'
    from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:57:in `times'
    from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:57:in `SV_decomp'
    from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/lsi.rb:311:in `build_reduced_matrix'
    from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/lsi.rb:143:in `build_index'
    from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/lsi.rb:77:in `add_item'
    from D:/projects/P_hobbit/AI/log_classifier/pass_fail.rb:34:in `<main>'

I find the issue can be fixed with below change, please help to review

https://github.com/jekyll/classifier-reborn/pull/154

tra38 commented 7 years ago

The code could fix this specific issue (I haven't checked to be sure), but would break other code. That line was used to filter out words that have 2 or fewer characters...and while I'm not quite sure why it does this filtering, I'm afraid that the LSI might fail horribly when handling very small words. Current automated tests are failing since they are dependent on the current filtering behavior. If we can figure out why the previous programmers caused the "small words" to be filtered, we then can decide whether it is possible to add an exception that will allow us to accept digits.

In any event, I would suggest writing a new automated test for handling edge cases where numerical digits matter, so that we don't accidentally reintroduce the same behavior in the future...while also making sure all previous automated tests pass as well.

Ch4s3 commented 7 years ago

The LSI will in fact fail horribly with a NaN/NaN error if you remove this filter.

hakehuang commented 7 years ago

can you give me some test examples? @Ch4s3 . I have fixed the unit test issues. the 1 byte judgement is a real user case in my application, and I believe this requirement is universal

Ch4s3 commented 7 years ago

let me take a look tonight

Ch4s3 commented 7 years ago

could you better describe your use case @hakehuang?

hakehuang commented 7 years ago

I want to classify my build log, which usually appears as below: Error: 0 means there are no error Error: mean there are error.

2017-03-02 13:14 GMT+08:00 Chase Gilliam notifications@github.com:

could you better describe your use case @hakehuang https://github.com/hakehuang?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-283558623, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xoZ1FvUTftzmVQZQ51LuHLIKirQEks5rhlBRgaJpZM4MN5bP .

Ch4s3 commented 7 years ago

It seems like you could do that more reliably with a regex or simple string match.

hakehuang commented 7 years ago

yes o no,

some times the string goes this way: Errors is: 0 Errors for this is : 3

it is very difficult to use a regex to match the diferences.

2017-03-03 0:59 GMT+08:00 Chase Gilliam notifications@github.com:

It seems like you could do that more reliably with a regex or simple string match.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-283712663, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xt7BJM4F7H2-JzyqbH3FM9Ou2ZKmks5rhvWMgaJpZM4MN5bP .

Ch4s3 commented 7 years ago

I'm still not sure LSI is correct. Have you tried the Bayesian classifier? You can set it up not to use stop words. However if I were you, I would just write a simple parser and match on the number.

You could also use scan

foo = "Errors is: 0"
bar = "Errors for this is : 3"
foo_num = foo.scan.scan(/\d/)
bar_num = bar.scan(/\d/)
hakehuang commented 7 years ago

there are many of such patterns, below I just list a few. and all those errors are mixed in a log, with many human readable context for debugging purpose. My idea is to have a log parser, which can classify all the error types, and give me a summary of all. I tried Bayesian and Naive Bayes, which works, but only LSI can give me a search function.

undefined symbol
undefined reference to
not defined
not define
java.lang.Exception: java.lang.InterruptedException
no definition for
enumeration value is out of
identifier is undefined
defined but not used
not fit in region
invalid operands to binary | (have 'int' and 'void *')
unable to allocate space for sections/blocks with a total estimated minimum size
with offset out of bounds
error loading bundle activator
no such file or directory
cannot be found
passing arg n of makes pointer from integer without a cast
was unable to load
exceeds the maximum allowed for
cannot open source file
cannot find source file
cannot fit into
not allowed
not facet-valid with respect to pattern
can not open
pointless integer comparison, the result is always false
cannot call
cannot be assigned to
cannot call intrinsic function
a function call cannot appear in a constant-expression
too few arguments in function call
was not declared in this scope
may be used uninitialized in this function
interact script return value
first use in this function
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
board.c(60) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
board.c(60) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
MKV58F24.h(326) : Fatal Error[Pe1696]: cannot open source file "MKV58F24.h(326) : Fatal Error[Pe1696]: cannot open source file "FreeRTOS.h(98) : Fatal Error[Pe1696]: cannot open source file "FreeRTOSConfig.h"
fsl_flash.h(68) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
tra38 commented 7 years ago

Since I found your use case interesting, I decided to try to replicate the original case, except that it...er...works.

lsi = ClassifierReborn::LSI.new
lsi.add_item 'log message Error: 1', :Error
lsi.add_item 'log message Error: 0', :Pass

lsi.classify 'log message Error: 1'
#=> :Pass

Obviously, it's giving us the wrong answer, and looking at the LSI object suggests that it is due to the program ignoring one-character objects (digits) and not including them in the word_hashes:

=> #<ClassifierReborn::LSI:0x007f7f79980828
 @auto_rebuild=true,
 @built_at_version=2,
 @cache_node_vectors=nil,
 @items=
  {"log message Error: 1"=>
    #<ClassifierReborn::ContentNode:0x007f7f79972200
     @categories=[:Error],
     @lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @word_hash={:log=>1, :messag=>1, :error=>1}>,
   "log message Error: 0"=>
    #<ClassifierReborn::ContentNode:0x007f7f799713c8
     @categories=[:Pass],
     @lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @word_hash={:log=>1, :messag=>1, :error=>1}>},
 @language="en",
 @version=2,
 @word_list=
  #<ClassifierReborn::WordList:0x007f7f799711c0
   @location_table={:log=>0, :messag=>1, :error=>2}>>

So there's still that issue to deal with.

But we also have another issue at play. It's working fine on my machine while it's crashing on yours. My hypothesis for why it's crashing is based on the specific error message

D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58

You are using vector.rb because you do not have the GSL and the the GSL Ruby Gem (to interface with the GSL) installed. Basically, if you don't have GSL on your computer, we load up our own (slower) scientific calculation library instead, which included the file "vector.rb". So there must be a bug within classifier-reborn's vector.rb file that is causing this specific error message to occur. According to the docs though, it is recommended that you install GSL, since it will make LSI "at least 10x" faster, so if you plan on using LSI, I would suggest you set up GSL on your local machine.

If you plan on not installing GSL, well...Unfortunately, I don't know enough about SVD to feel confident about debugging it. @Ch4s3, do you feel confident?

Ch4s3 commented 7 years ago

@tra38, no unfortunately our SVD function was not super well implemented, and is a bit beyond my ability with linear algebra to fix. I intend to replace it with a native ext at some point.

hakehuang commented 7 years ago

the Bayesian classifier has some other issue for my cases, which I am trying to debugging now. I drop some hot fixes of mine. with this fix, the bayes clasifier seems works fine for my cases.

diff --git a/lib/classifier-reborn/bayes.rb b/lib/classifier-reborn/bayes.rb index 3d5bbf1..d658856 100644 --- a/lib/classifier-reborn/bayes.rb +++ b/lib/classifier-reborn/bayes.rb @@ -126,16 +126,23 @@ module ClassifierReborn end return score end

2017-03-04 2:44 GMT+08:00 Chase Gilliam notifications@github.com:

I'm still not sure LSI is correct. Have you tried the Bayesian classifier? You can set it up not to use stop words. However if I were you, I would just write a simple parser and match on the number.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-284036334, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xo7o2ye8JebwEHSTx4Ci3yji2mviks5riF-igaJpZM4MN5bP .

hakehuang commented 7 years ago

the SVD seems a big challenge for all AI users, do you know any Ruby solutions for this? using a LAPACK backend seems not that good for cloud deployment.

2017-03-06 12:39 GMT+08:00 Chase Gilliam notifications@github.com:

@tra38 https://github.com/tra38, no unfortunately our SVD function was not super well implemented, and is a bit beyond my ability with linear algebra to fix. I intend to replace it with a native ext at some point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/153#issuecomment-284301344, or mute the thread https://github.com/notifications/unsubscribe-auth/AAw1xjcb2dzee-P1PWrbhTUPfzla6v4Iks5ri44bgaJpZM4MN5bP .

Ch4s3 commented 7 years ago

There aren't any good pure Ruby implementations that I'm aware of.

mach-kernel commented 7 years ago

I am also having issues using LSI on small words, with Math::DomainError being raised. I skip training those words as my current solution. For background, the corpus I am using are directly pulled from credit card compliance information (e.g. has dollar amounts, random - characters, etc).

lessaworld commented 7 years ago

Just came across this same issue... I know it's not a long term solution, but since I'm just evaluating this project, instead of skipping the small words, I created a hack function to just go around the problem, for now.

def fixhack (text) text.split(" ").map! {|w| w.size < 3 ? w+"" : w}.join(" ") end

and then, I just wrap every mention of the content during training and classification. e.g.

lsi = ClassifierReborn::LSI.new lsi.add_item fix_hack("This is a test"), "test" ... c, s = lsi.classify_with_score fix_hack("It is a test")

epugh commented 6 years ago

For me, brew install gsl and adding the GSL dependency:

gem 'classifier-reborn'  # lets get machine learning!
gem 'gsl', '~> 2.1', '>= 2.1.0.3'

has solved the sqrt issue and the other NaN issue, I think!

Ch4s3 commented 6 years ago

@epugh Have you tried with small words ~3-4 chars in length?

epugh commented 6 years ago

Yep, and with those, I just get a warning message, the code runs.

Here is my test set:

    strings = [["This text deals with dogs. Dogs.", :dog],
               ["This text involves dogs too. Dogs!", :dog],
               ["LOOKING FOR SPEAKER", :missing],
               ["Need speaker!", :missing],
               ["Need speakers!", :missing],
               ["n/a OSC Retreat.", :missing],
               ["na", :missing],
               ["spearks are needed", :missing],
               ["Matt Datastax.", :present]]
    strings.each { |x| classifier.add_item x.first, x.last }

    assert_same :missing, (classifier.classify ("speaker needed"))
    assert_not_same :missing, (classifier.classify ("Matt Overstreet Solr Stemmers"))
    assert_same :present, (classifier.classify ("Matt Overstreet Solr Stemmers"))
epugh commented 6 years ago

So the "na" gives an error, and previously before I installed gsl, the "n/a" blew up!

Ch4s3 commented 6 years ago

Unfortunately that's expected behavior, but not the desired behavior. Out plain ruby lsi implementation is pretty broken, and I lack the math background necessary to fix it.

epugh commented 6 years ago

I wonder if the best path is to say "You must have GSL installed"? I;e accept the plain ruby issues...

Ch4s3 commented 6 years ago

@epugh unfortunately we're a dependency of Jekyll, so we want to have a ruby only option to make it more accessible. However, for any sort of prod use beyond that, we strongly endorse GSL.