Closed mkasberg closed 2 years ago
@mattr- appreciate all the PR reviews you've already done for me! I still have this PR marked as a draft, but I think it's ready for an initial round of feedback if you have time to do a review. Here are a couple things to focus on:
$GSL
to indicate whether to use the gsl
gem. I'm currently taking a similar approach, replacing that $GSL
boolean with $SVD
, which can be one of :ruby
, :gsl
, or :numo
. I'm open to suggestions if you'd prefer a different approach.gsl
gem). So I'm applying the same patterns that existed before to try loading a library and falling back, which you can see near the top of lsi.rb
.@mattr- This is ready for review! No rush :slightly_smiling_face:
I addressed your previous comment, added a little polish, and updated the docs since your last review.
Also, I tested this on a personal jekyll site (i.e. with gem 'classifier-reborn', path: '~/code/classifier-reborn'
in my Gemfile) and verified that using Numo procudes the same recommended sites in the Jekyll output as GSL.
@jekyllbot: merge +minor
Background: The slow step of LSI is computing the SVD (singular value decomposition) of a matrix. Even with a relatively small collection of documents (say, about 20 blog posts), the native ruby implementation is too slow to be usable (taking hours to complete).
To work around this problem, classifier-reborn allows you to optionally use the
gsl
gem to make use of the Gnu Scientific Library when performing matrix calculations. Computations with this gem perform orders of magnitude faster than the ruby-only matrix implementation, and they're fast enough that using LSI with Jekyll finishes in a reasonable amount of time (seconds).Unfortunately, rb-gsl is unmaintained -- there's a commit on main that makes it compatible with Ruby 3, but nobody has released the gem so the only way to use rb-gsl with Ruby 3 right now is to specify the git hash in your Gemfile. See https://github.com/SciRuby/rb-gsl/issues/67. This will be increasingly problematic because Ruby 2.7 is now in security maintenance and will become end of life in less than a year.
Notably,
rb-gsl
depends on the narray gem.narray
is deprecated, and the readme suggests usingNumo::NArray
instead.Changes: In this PR, my goal is to provide an alternative matrix implementation that can perform singular value decomposition quickly and works with Ruby 3. Doing so will make classifier-reborn compatible with Ruby 3 without depending on the unmaintained/unreleased gsl gem. There aren't many gems that provide fast matrix support for ruby, but Numo seems to be more actively maintained than rb-gsl, and Numo has a working Ruby 3 implementation that can perform a singular value decomposition, which is exactly what we need. This requires numo-narray and numo-linalg.
My goal is to allow users to (optionally) use classifier-reborn with Numo/Lapack the same way they'd use it with GSL. That is, the user should install the
numo-narray
andnumo-linalg
gems (with their required C libraries), and classifier-reborn will detect and use these if they are found.