Using TS with a remote sphinx service

jdelStrother commented 5 years ago

Hi there, I'm currently trying to get thinking-sphinx working with searchd in a docker container, though I think a lot of the same issues apply if you were running searchd on a separate server to your Rails servers. I was hoping to discuss either workarounds that people are using for these cases, or work that we could do on thinking-sphinx to improve that workflow.

There's two main pain points I've been hitting:

Config-generation seems pretty insistent on calling mkdir_p for various directories, which isn't very useful if you're trying to generate configuration for a remote machine.
It seems like we ought to be able to call rake ts:index from a local Rails server and have it populate our realtime indexes on a remote server. However, TS also tries to check that searchd is running (via the pid file) and tries to rotate the index after it's done populating.

In my hacky experimentation I've been working around these with this rake file:

namespace :ts do
  task docker_configure: :environment do
    config = ThinkingSphinx::Configuration.instance
    # force the configuration to load the "docker" key out of thinking_sphinx.yml
    config.framework = ThinkingSphinx::Frameworks::Plain.new.tap { |f| f.environment = "docker"; f.root = Rails.root }
    # sphinx configuration is going to attempt to create some dirs that don't exist locally.  Ignore them.
    def FileUtils.mkdir_p(dir)
      puts "Ignoring request to mkdir_p('#{dir}')"
    end

    interface.configure
  end

  task docker_index: :environment do
    # hack to allow running, ts:index against a remote sphinx service.
    ThinkingSphinx::Commander.registry[:running] = proc { puts "fake-sphinx running!" }
    interface.rt.index
  end
end

with this docker-compose:

version: '3'
services:
  sphinx:
    image: macbre/sphinxsearch:3.0.1
    ports:
     - "9306:9306"
    volumes:
      - ./config/docker.sphinx.conf:/opt/sphinx/conf/sphinx.conf
      - ./lib/dict:/opt/sphinx/lib/dict

Any thoughts/plans on separating out some of the TS code that only works if you're running Rails & Sphinx side-by-side? Or am I doing it all wrong?

(Previous docker discussions at https://github.com/pat/thinking-sphinx/issues/1010)

pat commented 5 years ago

Second issue first: definitely sounds like something that should be fixed, probably via an environment variable. I'll look into it soon :)

As for the first: you should only be running the configure task on the Sphinx server - there's no value having it occur on the client servers. However, if this is only a problem due to the index task running configure automatically, you can use INDEX_ONLY=true. That said, would it make sense to just run all the TS tasks only on your Sphinx server?

jdelStrother commented 5 years ago

you should only be running the configure task on the Sphinx server - there's no value having it occur on the client servers. [....] would it make sense to just run all the TS tasks only on your Sphinx server?

The thing I'm trying to get away from is that we have a big monolithic Rails app with a lot of dependencies (both gems, and compiled libraries like ImageMagick). So right now our Sphinx server needs to have all those irrelevant dependencies installed just so that we can generate a sphinx config file. (Admittedly this approach I'm trying of generating the config file from a Rails server then shipping it over to the sphinx server, is also filled with drawbacks.)

pat commented 5 years ago

Just pushed some commits to the develop branch which add two boolean settings (which can be turned on per-environment in config/thinking_sphinx.yml): skip_directory_creation and skip_running_check. This should remove the need for your monkey patches, but would appreciate the confirmation after you've tested it! :)

jdelStrother commented 5 years ago

Yep, both work great thanks 🎉

pat commented 5 years ago

Excellent! And that means all the Docker stuff's working well, without your Rails app on the Sphinx server?

jdelStrother commented 5 years ago

Yep, my docker-searchd container seems to be working fine. It's basically just using the macbre/sphinxsearch image with my config file mounted into it.

xtrasimplicity commented 5 years ago

Is there a release scheduled for this feature, @pat? This looks like it could solve some of the issues that have been making me put off de-monolithifying a project I've been working on. Thanks!

pat commented 5 years ago

There's no release just yet - there's a couple of outstanding issues I want to tackle first - but it's on my radar. With a bit of luck I'll have something out early next week 🤞

xtrasimplicity commented 5 years ago

Awesome - that sounds great. Thanks heaps!

pat commented 5 years ago

These settings are now part of the newly released v4.3.0 🎉

xtrasimplicity commented 5 years ago

Awesome! Thanks, Pat!

On Sat., 18 May 2019, 12:54 Pat Allan, notifications@github.com wrote:

Closed #1131 https://github.com/pat/thinking-sphinx/issues/1131.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pat/thinking-sphinx/issues/1131?email_source=notifications&email_token=ACHCTTIZX2SDE6UQPI6SXBTPV5VWXA5CNFSM4HE2UFV2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGORQO3IBA#event-2350756868, or mute the thread https://github.com/notifications/unsubscribe-auth/ACHCTTMYWFXWHM3PITHFDFTPV5VWXANCNFSM4HE2UFVQ .

alexanderadam commented 4 years ago

@jdelStrother / @xtrasimplicity did anyone experience significant performance issues on index creation? We're trying the mentioned setup with @macbre's Sphinx container and the referenced Thinking Sphinx settings skip_running_check & skip_directory_creation but it seems that rake ts:rebuild takes indeed ages.

Or does anyone have an idea what the reason could be or are there any tweaks / other suggestions?

Thank you in advance!

xtrasimplicity commented 4 years ago

@alexanderadam, We haven't had any major performance issues, but our database is quite small and a few minutes at startup isn't a huge issue for us as searching is only a tiny part of our application's functionality.

You could try increasing the size of /dev/shm from 64MB to something a bit higher, but I'm not sure if that will have any performance benefits for TS.

jdelStrother commented 4 years ago

@alexanderadam Our rebuilds are pretty slow by default, on a database with something like 5 million documents. We've monkeypatched it with an alternative approach:

class ThinkingSphinx::RealTime::Populator
  def populate(&block)
    instrument "start_populating"

    limit = ENV["RT_BATCH_LIMIT"] || nil
    cnt = 0
    scope.find_in_batches(batch_size: batch_size) do |instances|
      break if limit && (cnt += 1) > limit.to_i
      transcriber.copy(*instances)
      instrument "populated", instances: instances
    end

    instrument "finish_populating"
  end
end

When we need to rebuild a sphinx index, we'll run, eg:

bin/rake ts:rebuild INDEX_FILTER=posts_rt_core RT_BATCH_LIMIT=1000

just to get sphinx back up-and-running with a few documents, and then incrementally populate it with something like this:

Post.find_each do |post|
  ThinkingSphinx::RealTime.callback_for(:post).after_save(post)
end

pat / thinking-sphinx

Using TS with a remote sphinx service #1131