pat / thinking-sphinx

Sphinx/Manticore plugin for ActiveRecord/Rails
http://freelancing-gods.com/thinking-sphinx
MIT License
1.63k stars 470 forks source link

Using TS with a remote sphinx service #1131

Closed jdelStrother closed 5 years ago

jdelStrother commented 5 years ago

Hi there, I'm currently trying to get thinking-sphinx working with searchd in a docker container, though I think a lot of the same issues apply if you were running searchd on a separate server to your Rails servers. I was hoping to discuss either workarounds that people are using for these cases, or work that we could do on thinking-sphinx to improve that workflow.

There's two main pain points I've been hitting:

In my hacky experimentation I've been working around these with this rake file:

namespace :ts do
  task docker_configure: :environment do
    config = ThinkingSphinx::Configuration.instance
    # force the configuration to load the "docker" key out of thinking_sphinx.yml
    config.framework = ThinkingSphinx::Frameworks::Plain.new.tap { |f| f.environment = "docker"; f.root = Rails.root }
    # sphinx configuration is going to attempt to create some dirs that don't exist locally.  Ignore them.
    def FileUtils.mkdir_p(dir)
      puts "Ignoring request to mkdir_p('#{dir}')"
    end

    interface.configure
  end

  task docker_index: :environment do
    # hack to allow running, ts:index against a remote sphinx service.
    ThinkingSphinx::Commander.registry[:running] = proc { puts "fake-sphinx running!" }
    interface.rt.index
  end
end

with this docker-compose:

version: '3'
services:
  sphinx:
    image: macbre/sphinxsearch:3.0.1
    ports:
     - "9306:9306"
    volumes:
      - ./config/docker.sphinx.conf:/opt/sphinx/conf/sphinx.conf
      - ./lib/dict:/opt/sphinx/lib/dict

Any thoughts/plans on separating out some of the TS code that only works if you're running Rails & Sphinx side-by-side? Or am I doing it all wrong?

(Previous docker discussions at https://github.com/pat/thinking-sphinx/issues/1010)

pat commented 5 years ago

Second issue first: definitely sounds like something that should be fixed, probably via an environment variable. I'll look into it soon :)

As for the first: you should only be running the configure task on the Sphinx server - there's no value having it occur on the client servers. However, if this is only a problem due to the index task running configure automatically, you can use INDEX_ONLY=true. That said, would it make sense to just run all the TS tasks only on your Sphinx server?

jdelStrother commented 5 years ago

you should only be running the configure task on the Sphinx server - there's no value having it occur on the client servers. [....] would it make sense to just run all the TS tasks only on your Sphinx server?

The thing I'm trying to get away from is that we have a big monolithic Rails app with a lot of dependencies (both gems, and compiled libraries like ImageMagick). So right now our Sphinx server needs to have all those irrelevant dependencies installed just so that we can generate a sphinx config file. (Admittedly this approach I'm trying of generating the config file from a Rails server then shipping it over to the sphinx server, is also filled with drawbacks.)

pat commented 5 years ago

Just pushed some commits to the develop branch which add two boolean settings (which can be turned on per-environment in config/thinking_sphinx.yml): skip_directory_creation and skip_running_check. This should remove the need for your monkey patches, but would appreciate the confirmation after you've tested it! :)

jdelStrother commented 5 years ago

Yep, both work great thanks πŸŽ‰

pat commented 5 years ago

Excellent! And that means all the Docker stuff's working well, without your Rails app on the Sphinx server?

jdelStrother commented 5 years ago

Yep, my docker-searchd container seems to be working fine. It's basically just using the macbre/sphinxsearch image with my config file mounted into it.

xtrasimplicity commented 5 years ago

Is there a release scheduled for this feature, @pat? This looks like it could solve some of the issues that have been making me put off de-monolithifying a project I've been working on. Thanks!

pat commented 5 years ago

There's no release just yet - there's a couple of outstanding issues I want to tackle first - but it's on my radar. With a bit of luck I'll have something out early next week 🀞

xtrasimplicity commented 5 years ago

Awesome - that sounds great. Thanks heaps!

pat commented 5 years ago

These settings are now part of the newly released v4.3.0 πŸŽ‰

xtrasimplicity commented 5 years ago

Awesome! Thanks, Pat!

On Sat., 18 May 2019, 12:54 Pat Allan, notifications@github.com wrote:

Closed #1131 https://github.com/pat/thinking-sphinx/issues/1131.

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pat/thinking-sphinx/issues/1131?email_source=notifications&email_token=ACHCTTIZX2SDE6UQPI6SXBTPV5VWXA5CNFSM4HE2UFV2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGORQO3IBA#event-2350756868, or mute the thread https://github.com/notifications/unsubscribe-auth/ACHCTTMYWFXWHM3PITHFDFTPV5VWXANCNFSM4HE2UFVQ .

alexanderadam commented 4 years ago

@jdelStrother / @xtrasimplicity did anyone experience significant performance issues on index creation? We're trying the mentioned setup with @macbre's Sphinx container and the referenced Thinking Sphinx settings skip_running_check & skip_directory_creation but it seems that rake ts:rebuild takes indeed ages.

Or does anyone have an idea what the reason could be or are there any tweaks / other suggestions?

Thank you in advance!

xtrasimplicity commented 4 years ago

@alexanderadam, We haven't had any major performance issues, but our database is quite small and a few minutes at startup isn't a huge issue for us as searching is only a tiny part of our application's functionality.

You could try increasing the size of /dev/shm from 64MB to something a bit higher, but I'm not sure if that will have any performance benefits for TS.

jdelStrother commented 4 years ago

@alexanderadam Our rebuilds are pretty slow by default, on a database with something like 5 million documents. We've monkeypatched it with an alternative approach:

class ThinkingSphinx::RealTime::Populator
  def populate(&block)
    instrument "start_populating"

    limit = ENV["RT_BATCH_LIMIT"] || nil
    cnt = 0
    scope.find_in_batches(batch_size: batch_size) do |instances|
      break if limit && (cnt += 1) > limit.to_i
      transcriber.copy(*instances)
      instrument "populated", instances: instances
    end

    instrument "finish_populating"
  end
end

When we need to rebuild a sphinx index, we'll run, eg:

bin/rake ts:rebuild INDEX_FILTER=posts_rt_core RT_BATCH_LIMIT=1000

just to get sphinx back up-and-running with a few documents, and then incrementally populate it with something like this:

Post.find_each do |post|
  ThinkingSphinx::RealTime.callback_for(:post).after_save(post)
end