jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
554 stars 110 forks source link

Using classifier-reborn in Jekyll doesn't work? #170

Closed mthmulders closed 6 years ago

mthmulders commented 6 years ago

Not sure if this is the right place to ask... Feel free to redirect my question if it doesn't belong here.

I'm using the following:

I'm trying to use LSI to build "related posts", so that I can enrich a document in Jekyll with references to related documents.

However, the site.related_posts variable in Jekyll is always nil. I investigated this by adding the following snippet in my Liquid template:

<pre><code>{{ site.related_post | inspect }}</code></pre>

And this snippet always renders to

<pre><code>nil</code></pre>

I don't know how to troubleshoot this any further. Any clues / suggestions?

Ch4s3 commented 6 years ago

There are a few possibilities here. There could be a regression in 2.2.0 that we missed, your input could be invalid and Jekyll is swallowing the error, or you could be hitting some odd edge case in the LSI.

Are you using the GSL lib? What does your input look like? Do you have any super short posts, like maybe only a line or two? Does Classifier Reborn 2.1.0 work as expected?

mthmulders commented 6 years ago

Thanks for the suggestions!

Would there be a way to manually try invoke Classifier Reborn on my set of Markdown documents and see if some error occurs?

Ch4s3 commented 6 years ago

Would there be a way to manually try invoke Classifier Reborn on my set of Markdown documents and see if some error occurs?

Yes, you can do it from irb. You'll need to require 'classifier-reborn' first.

Then you can set up a new lsi classifier:

require 'classifier-reborn'
lsi = ClassifierReborn::LSI.new

Then read in each of your markdown files using File.open. After that feed the markdown into the classifier as shown in the docs. Something like this:

strings = [["This text deals with dogs. Dogs.", :dog],
           ["This text involves dogs too. Dogs!", :dog],
           ["This text revolves around cats. Cats.", :cat],
           ["This text also involves cats. Cats!", :cat],
           ["This text involves birds. Birds.", :bird]]
strings.each { |x| lsi.add_item x.first, x.last }

You'll be trying to use the find_related method. If it blows up, post the error here. If it works, then we should get in touch with our friends over at Jekyll.

mthmulders commented 6 years ago

Thanks again for the suggestions!

I did some experiments with ClassifierReborn::LSI.new, add_item and find_related, while reading raw Markdown files. Since I don't have categories, I skipped the second argument to add_item. It seems to work pretty well, giving me documents that related to the text I was looking for. So maybe it's something in the Jekyll / Classifier Reborn integration indeed?

For reference, here is the script that I experimented with

#!/usr/bin/env ruby
require 'classifier-reborn'

lsi = ClassifierReborn::LSI.new

paths = [
  "_posts/2013-03-11-ipv6-on-raspbian.md",
  "_posts/2017-02-25-blah-blah-microservices-blah-blah.md",
  "_posts/2017-06-22-jbcnconf-and-voxxedlu.md",
  "_posts/2017-12-30-getting-started-with-zuul.md"
]

paths.each do |path|
  puts "Reading file " << path
  File.open(path) do |file|
    post = ""
    file.each do |line|
      post << line
    end
    lsi.add_item(post)
  end
end

puts "Finding related stuff"

related = lsi.find_related("In these days of microservices", 1)

puts "Related text:"
puts related
Ch4s3 commented 6 years ago

Interesting, I was expecting this to reveal an issue. I guess we need to see what's going on over at Jekyll. @parkr @jekyll/administrators I'll file an issue and see if we can get to the bottom of this.

parkr commented 6 years ago

I don’t believe site.related_post is a thing, so that makes sense. Please try site.related_posts instead.

mthmulders commented 6 years ago

Well, now I feel stupid...!

Indeed, site.related_posts contains some related documents. Strange thing, though, is that I can see that in the layout for my post, but not in a separate file which populates a side-bar next to the post content. Will dive into that. Thanks a lot!

Ch4s3 commented 6 years ago

Glad we got this figured out.