jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
548 stars 109 forks source link

Support Numo Gem for performing SVD #198

Closed mkasberg closed 2 years ago

mkasberg commented 2 years ago

Background: The slow step of LSI is computing the SVD (singular value decomposition) of a matrix. Even with a relatively small collection of documents (say, about 20 blog posts), the native ruby implementation is too slow to be usable (taking hours to complete).

To work around this problem, classifier-reborn allows you to optionally use the gsl gem to make use of the Gnu Scientific Library when performing matrix calculations. Computations with this gem perform orders of magnitude faster than the ruby-only matrix implementation, and they're fast enough that using LSI with Jekyll finishes in a reasonable amount of time (seconds).

Unfortunately, rb-gsl is unmaintained -- there's a commit on main that makes it compatible with Ruby 3, but nobody has released the gem so the only way to use rb-gsl with Ruby 3 right now is to specify the git hash in your Gemfile. See https://github.com/SciRuby/rb-gsl/issues/67. This will be increasingly problematic because Ruby 2.7 is now in security maintenance and will become end of life in less than a year.

Notably, rb-gsl depends on the narray gem. narray is deprecated, and the readme suggests using Numo::NArray instead.

Changes: In this PR, my goal is to provide an alternative matrix implementation that can perform singular value decomposition quickly and works with Ruby 3. Doing so will make classifier-reborn compatible with Ruby 3 without depending on the unmaintained/unreleased gsl gem. There aren't many gems that provide fast matrix support for ruby, but Numo seems to be more actively maintained than rb-gsl, and Numo has a working Ruby 3 implementation that can perform a singular value decomposition, which is exactly what we need. This requires numo-narray and numo-linalg.

My goal is to allow users to (optionally) use classifier-reborn with Numo/Lapack the same way they'd use it with GSL. That is, the user should install the numo-narray and numo-linalg gems (with their required C libraries), and classifier-reborn will detect and use these if they are found.

mkasberg commented 2 years ago

@mattr- appreciate all the PR reviews you've already done for me! I still have this PR marked as a draft, but I think it's ready for an initial round of feedback if you have time to do a review. Here are a couple things to focus on:

mkasberg commented 2 years ago

@mattr- This is ready for review! No rush :slightly_smiling_face:

I addressed your previous comment, added a little polish, and updated the docs since your last review.

Also, I tested this on a personal jekyll site (i.e. with gem 'classifier-reborn', path: '~/code/classifier-reborn' in my Gemfile) and verified that using Numo procudes the same recommended sites in the Jekyll output as GSL.

mattr- commented 2 years ago

@jekyllbot: merge +minor