jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
554 stars 110 forks source link

LSI is broken af. #69

Open henrebotha opened 8 years ago

henrebotha commented 8 years ago

I'm on Ruby 2.2.4. I'm trying to use LSI. Nothing works, and the error messages SUCK. I've tried both the last release (i.e. the gem version) and the latest commit from Github.

lsi = ClassifierReborn::LSI.new
training_data = ["Bcom", "Corporate Administration", "Forensic Auditing"]
category = :accounting
training_data.each do |d|
  begin
    lsi.add_item(d, category)
  rescue StandardError => e
    puts "#{d} misbehaving: #{e.message}"
  end
end

#=> Forensic Auditing misbehaving: comparison of Float with NaN failed

Better yet, if I swap the order of the training data, I get this:

lsi = ClassifierReborn::LSI.new
training_data = ["Corporate Administration", "Forensic Auditing", "Bcom"]
category = :accounting
training_data.each do |d|
  begin
    lsi.add_item(d, category)
  rescue StandardError => e
    puts "#{d} misbehaving: #{e.message}"
  end
end

#=> Forensic Auditing misbehaving: comparison of Float with NaN failed
#=> Bcom misbehaving: comparison of Float with NaN failed
Ch4s3 commented 8 years ago

There are some known issues with LSI. Are you using GNU GSL or the native Ruby version? If you're using the native ruby version, it relies on a buggy Ruby implementation of a matrix transform (discussed here #30) and throws this type of error for some input. If that's the case, using GNU GSL will fix this. If you're using GNU GSL, this will require some digging.

henrebotha commented 8 years ago

You're fast! I am using the native Ruby version. I'll hit up GNU GSL and see what happens.

If I were you I'd mention this in the Readme.

Ch4s3 commented 8 years ago

I happened to be on the issues. Yeah, let me know how GNU GSL works out. I need to rewrite the SVD, but I'm not a great C programmer so the process has been slow to say the least. If you're trying to train with small inputs especially ones that use abbreviations, the matrix transform is highly likely to break in the Ruby only version.

henrebotha commented 8 years ago

While I have you, I'm getting this:

GSL::ERROR::EUNIMPL: Ruby/GSL error code 24, svd of MxN matrix, M<N, is not implemented (file svd.c, line 61), the requested feature is not (yet) implemented
from /Users/leaply/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/bundler/gems/classifier-reborn-4e3bb14d6388/lib/classifier-reborn/lsi.rb:292:in `SV_decomp'
Ch4s3 commented 8 years ago

Hum, could be related to this https://github.com/SciRuby/rb-gsl/issues/21. I'm investigating.

Ch4s3 commented 8 years ago

Which version of GSL did you pull down?

henrebotha commented 8 years ago

1.16 via homebrew

Ch4s3 commented 8 years ago

1.16 might work, let me try to pull down fresh versions later and try locally.

Ch4s3 commented 8 years ago

I haven't gotten anywhere with this, can anyone else reproduce this?

Ch4s3 commented 8 years ago

@henrebotha can you try with the latest master to see if #77 raises an error on your input?

henrebotha commented 8 years ago

That's gonna take some doing. I'll try when I have access to a Mac.

Ch4s3 commented 7 years ago

@henrebotha have you tried this yet?

Ch4s3 commented 7 years ago

I intend to close this if there's no more action in the next few days.

timcraft commented 7 years ago

@Ch4s3 @henrebotha I'm seeing the same issue with my data and can reproduce with this script:

require 'classifier-reborn'

lsi = ClassifierReborn::LSI.new

# Without gsl this raises NoMethodError
# /classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:143:
# in `block in build_index': undefined method `normalize' for nil:NilClass

# With gsl this raises GSL::ERROR::EUNIMPL
# /classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:292:in `SV_decomp':
# Ruby/GSL error code 24, svd of MxN matrix, M<N, is not implemented (file svd.c, line 60),
# the requested feature is not (yet) implemented

lsi.add_item 'England', 'xx'
lsi.add_item 'England & Wales', 'xx'
lsi.add_item 'England And Wales', 'xx'

Using GNU GSL, tried upgrading from 2.2.1 to 2.3 and that didn't fix it.

Related to this TODO in lsi.rb?

mepatterson commented 7 years ago

Any ideas on this? I'm seeing the Ruby/GSL-derived exception in SV_decomp whenever I try to build an index on more than around 2,000 sentences. I have 4,007 sentences I'd like to index. For those 2000 the classifier works great for my purpose, so I'm really eager to find a way to get this working properly, if possible...

(to be fair, it probably has nothing to do with how many sentences I have and more to do with some sentence entering the index beyond 2000 that is causing a problem like seen in other comments above...)

Ch4s3 commented 7 years ago

@mepatterson I'd guess you have some malformed input. Can you throw a begin rescue around your training and see which doc/line blows it up?

@timcraft I know this sounds stupid, but have you double checked that you're actually using GNU GSL? It may not have loaded correctly.

mepatterson commented 7 years ago

Actually I can confirm @timcraft repro also. Just those three add item lines will cause the gsl crash every time on my machine using very latest gsl and rb-gsl

On Fri, Mar 10, 2017 at 1:23 PM Chase Gilliam notifications@github.com wrote:

@mepatterson https://github.com/mepatterson I'd guess you have some malformed input. Can you throw a begin rescue around your training and see which doc/line blows it up?

@timcraft https://github.com/timcraft I know this sounds stupid, but have you double checked that you're actually using GNU GSL? It may not have loaded correctly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jekyll/classifier-reborn/issues/69#issuecomment-285760979, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEh33B7cKesfX1ZYXHOqJb4g4Adjomfks5rkaNFgaJpZM4IjPG5 .

Ch4s3 commented 7 years ago

Ok, I'll try to dig in this weekend.

timcraft commented 7 years ago

@Ch4s3 Yep, it appears to be loaded ok. I added this at the top of the script (matrix code from gsl-2.1.0.2/examples/linalg/SV.rb which uses SV_decomp):

puts "Using GSL/#{GSL::VERSION} RubyGSL/#{GSL::RUBY_GSL_VERSION}"
a = GSL::Matrix[[3, 5, 2], [6, 2, 1], [4, 7, 3]]
u, v, s = a.SV_decomp
p u*GSL::Matrix.diagonal(s)*v.trans

Output is Using GSL/2.3 RubyGSL/2.1.0.2, and the correct matrix.

elisaado commented 7 years ago

Same here.

I have GSL installed but it's not even loaded

Ch4s3 commented 7 years ago

@elisaado can you post any details?