thinking-sphinx with docker-compose

raviada commented 8 years ago

Hello Pat, I am trying to setup our development and deployments with docker containers. I am docker.app on mac, I configured rails and mysql, sphinx containers to work together in docker-compose. They all came up fine, rails talking to mysql etc fine. How do I configure thinking-sphinx to create indexes on sphinx container by connecting to mysql container. I did everything to run in one box before, now these containers act like different machines, how do I do that? Please share your experiences working with docker, docker-compose file would be a big help. Thanks in advance.

pat commented 8 years ago

I'm afraid I've not used Docker at all, so I'm not much help. This has been discussed a bit though in issue #975 - so that's probably a good place to start?

gingerlime commented 7 years ago

Hi @raviada - I'm still looking for a proper solution to this for production / scalable environment. So far I was only able to run sphinx inside the same container as our rails app, and this works. But I don't think it's production-ready, because in production, you'd ideally have more than one container with rails, so you end up with multiple instances of sphinx, which can easily go out-of-sync with each other.

Here's our Dockerfile for development

FROM ruby:2.3

RUN apt-get update -qq && apt-get install -y build-essential cmake libpq-dev imagemagick qt5-default libqt5webkit5-dev libmysqlclient-dev libodbc1

ENV RAILS_ENV development

ENV app /app
RUN mkdir $app
WORKDIR $app

RUN wget http://sphinxsearch.com/files/sphinx-2.2.9-release.tar.gz
RUN tar -zxvf sphinx-2.2.9-release.tar.gz
RUN cd sphinx-2.2.9-release && ./configure --with-pgsql --with-mysql && make && make install

ADD Gemfile $app/Gemfile
ADD Gemfile.lock $app/Gemfile.lock
ADD config/database.yml.docker-template $app/config/database.yml
RUN bundle install
ADD . $app

CMD rails s -b 0.0.0.0

@mateusz-useo might have been more successful than I was... I'm pretty sure it's do-able, but far from trivial unfortunately. Sadly it might be easier to switch to something like ElasticSearch than wrestle getting sphinx/TS in a dockerized environment.

pat commented 7 years ago

So, my understanding of Docker has increased slightly since this issue was first logged - though it's still minimal. But what I'd recommend is that the Sphinx container has a copy of the Rails app in it, so you can issue the TS rake tasks to it, but it doesn't have a web server running and is only for Sphinx.

Thus, you'll have a single container for Sphinx, and then as many web containers as you like.

gingerlime commented 7 years ago

Thanks @pat

I'm assuming running queries from your web rails containers to the sphinx container can be done over-the-wire using the mysql41 protocol? (that's the same as having a remote sphinx instance, but I've never done it, so not too sure about the fine-details).

What if you want to reindex e.g. from a background job though? (that's what I ended up doing in #1048 for example).

pat commented 7 years ago

mysql41 is just TCP, so it'll be done over the specified port without any issues 👍

As for re-indexing from a background job, I guess you'll want the background worker and Sphinx in the same container? Or at least, a worker for Sphinx-specific jobs in the Sphinx container (and then other background worker containers perform all non-Sphinx jobs).

gingerlime commented 7 years ago

I'll have to play around with it and see... any pointers on using a remote sphinx instance?

Regarding the re-indexing, that's just one example. The problem with this setup is that you always need to be aware of what calls the underlying searchd or indexer binaries, and need to make sure it runs in the right container. It sounds very limiting and awkward I have to say.

pat commented 7 years ago

A remote Sphinx instance is going to be similar to a remote database - you'll need to consider where the files are located (and have them backed up / re-used between instance boots).

As for reliance on the binaries - essentially it boils down to three key aspects that they're required for:

ts_delta jobs
ts rake tasks
Any of your own code that invokes indexing or daemon operations.

gingerlime commented 7 years ago

Thanks @pat. I think I understand the what, but not so much the how...

A remote Sphinx instance is going to be similar to a remote database - you'll need to consider where the files are located (and have them backed up / re-used between instance boots).

Which files are you referring to? (yaml files?). And yes, it's similar. But as a client to a remote database, you typically don't need to trigger any re-indexing and all operations are available over-the-wire (apart from rather rare DBA-type maintenance). With Sphinx/TS it's quite common to need to trigger those operations, and when the instance is remote it's suddenly far from trivial...

I wish there was some way to trigger those jobs over the mysql41 API itself, or if TS had some kind of a REST API that can be accessed remotely for those operations.

How would you carry out those operations that require binaries to run with a setup that has a dedicated Sphinx box connected by rails "clients" ?

pat commented 7 years ago

Two different thoughts on this: firstly, to keep as many operations happening via SphinxQL commands over the mysql41 protocol, you could consider switching to real-time indices. Real-time indices can only be created/updated via SphinxQL commands, so it removes all need for the indexer binary. Depending on how you're using Sphinx, of course, there could be other challenges from this switch, but I think it's worth investigating.

As for the files - when I was playing a little with Docker earlier this year, I saw recommendations of using PostgreSQL from within a Docker container, but linking it to my host machine's file system to ensure the database files were persisted across boots. I'm not sure if this is the way to do things normally (beyond development environments), but this is what I was thinking of with my last message.

The files in question for Sphinx would be the configuration file, the index files, logs, and perhaps the binlog files as well (all things that are configured via config/thinking_sphinx.yml). The binlog files are only useful between boots if the daemon crashes, so perhaps they're not so important in this scenario.

gingerlime commented 7 years ago

Thanks again @pat.

Real-time indices look interesting. What's the performance impact of them however from your experience? and also I'm a bit hesitant using callbacks (isn't this akin to managing model caching / sweepers? cache-invalidation is one of those sticky problems that always bites you when you least expect it)

Just curious - Is it technically possible to trigger those binary calls via SphinxQL over mysql41? or is it entirely impossible and the protocol doesn't support these types of "commands"?

As for files - as far as I can tell, Sphinx can be pretty stateless. Any files / configs, as well as search indexes are generated and then effectively cached. So if you have a running sphinx container, you load it once, and then either keep it running for as long as you need, or replace it with a new container. The new container will have to re-compile the configs and load the index etc, but once running, it's ready. So in that sense it's not really like PG or any other database. You don't lose any real data when you load your sphinx container "from scratch".

pat commented 7 years ago

I use real-time indices in all of my current projects where TS is being used, and don't notice any performance hits for the most part. Yes, the callbacks aren't ideal - I'm on board with your concerns there! - but it removes the need for deltas.

However, the initial indexing (which is done via the ts:generate or ts:regenerate tasks) is certainly slower, because every record is instantiated from within Rails, rather than via SQL queries. With this in mind, it's why I'd actually look at storing the Sphinx files in your container between boots - granted, this isn't such a big problem when developing locally, provided you don't have a huge amount of data.

Even with the callbacks, I'd still recommend having a scheduled cron job running ts:generate daily, to catch any data updates that haven't fired the callbacks.

From a quick scan of the SphinxQL docs, it doesn't look like there's anything in there for invoking indexer or searchd: http://sphinxsearch.com/docs/current.html#sphinxql-reference - so I'm afraid you can't avoid the dependency on the binaries completely (though as mentioned previously, indexer is no longer needed when using real-time indices).

gingerlime commented 7 years ago

That's good to know, @pat. (especially appreciate your being realistic about the trade-offs here).

I'm currently deliberating between adding a dedicated sidekiq process (and queue) inside the sphinx container, and using it as a tool to remotely trigger re-index operations by simply launching an async job. Or using realtime indices. Both options have some pros and cons, so we might just flip a coin and hope for the best ;-)

It's not something we're in a rush to implement, but when we do, I'd be sure to keep you posted. Maybe share some of our configs, recipes etc.

gingerlime commented 7 years ago

Thanks again for being so responsive and open, @pat.

pat commented 7 years ago

Appreciate the feedback, and it's great to know my comments are appreciated :) Any notes from your experiences down either path would be great - good luck with putting it all together!

webgem-jpl commented 6 years ago

I’m curious to know what you succeeded to do, I’m working on the same problem and for now I picked the Sphinx and delta worker design using a delayed job queue.

gingerlime commented 6 years ago

Hi @webgem-jpl. I didn't post an update since we never released this in production. But we did create a solution that seems to work. The solution was to:

create a Dockerfile similar to the one we use for rails, but with Sphinx code on top. This container will have access to rails, sidekiq as well as Sphinx binaries etc
Launch this container so it runs sphinx + sidekiq (see below). Sphinx will listen to queries from other containers, Sidekiq will listen on a dedicated sphinx queue, and will perform re-index operations when necessary.
Rails code will then launch re-index operations via a sidekiq job to the sphinx queue. This ensures that the running job has access to sphinx
Normal search queries are executed against the sphinx container using docker links
In this setup, the sphinx/sidekiq container must run a single instance. If you try to load balance this, then you'd need to build a way to make sure all instances re-index, which is more complex.

Our sphinx Dockerfile was essentially the same I pasted above, but without the CMD rails s -b 0.0.0.0 directive.

Here's the launch script for the sphinx container:

#!/bin/bash

bundle check || bundle install
# this makes sure sphinx is running and listening for queries
bundle exec rake ts:rebuild
# this launches sidekiq on the `sphinx` queue
bundle exec sidekiq -C config/sidekiq.yml -q sphinx

Hope this helps. I think it's a reasonable solution, but obviously has some limitations and probably isn't ideal in docker / unix philosophy terms...

xtrasimplicity commented 6 years ago

@gingerlime Please forgive me if this is a stupid question, but with that particular model/setup, how do you handle inter-container communication?

We have a dokku environment in production at the moment and don't have the time to migrate all of our apps away from Herokuish buildpacks, so I've written a custom sphinx buildpack and have launched two containers via a Procfile as follows:

thinking_sphinx: bundle exec rake ts:index && bundle exec rake ts:restart && bundle exec rake ts:periodically_reindex
web: bundle exec rackup -s puma -p $PORT -E $RACK_ENV

I've then set up a shared volume to store the indices, so that they're accessible from each container. 9 times out of 10 this works, but will sporadically give errors stating that Sphinx can't connect to the MySQL server.

In this setup, as with yours, each container has sphinx and thinking-sphinx installed and the thinking_sphinx container starts the Sphinx daemon and periodically re-indexes using a custom rake task (which just re-executes the ts:index rake task, sleeps for 90 seconds and then repeats). I'm using real-time indices.

For reference, I've set the config/thinking-sphinx.yml file to:

<%= Rails.env %>:
  mysql41: <%= ENV['SPHINX_PORT'] %>
  indices_location: <%= ENV['SPHINX_INDICES_LOCATION'] %>
  configuration_file: <%= ENV['SPHINX_CONFIGURATION_FILE_PATH'] %>
  log: <%= ENV['SPHINX_LOG_FILE_PATH'] %>
  query_log: <%= ENV['SPHINX_QUERY_LOG_FILE'] %>
  pid_file: <%= ENV['SPHINX_PID_FILE'] %>

Any insight as to how you set up the web containers to reference/communicate with the Sphinx container would be greatly appreciated.

Thanks!

gingerlime commented 6 years ago

Hi @xtrasimplicity this seems rather specific, and I'm mostly guessing here... but:

I wouldn't run sphinx on more than one container simultaneously. It just adds lots of complexity and sounds error-prone for little or no benefit. If you're looking to load-balance or distribute the sphinx load, you're entering highly complex territory in my opinion. Since you're using dokku, I'd vote for simplicity.
Running a single instance will also mean there's no need for a shared volume etc. It should be easy to debug, and perhaps even solve your weird problem?

Other than that, the main thing to check is that you publish the sphinx container via docker and give it a name (e.g. sphinx) and then make sure you're accessing it using this name from other containers.

xtrasimplicity commented 6 years ago

Thanks for the prompt response - it's much appreciated. I ended up upgrading to Sphinx v3.0.1, which solved the MySQL connection errors.

The plan is to only have one Sphinx container running at any time, with the web containers connecting to the single Sphinx container. Whilst each web container would also have Sphinx installed, as I'm using herokuish buildpacks, my intention would be that the local versions aren't used.

I've set the address attribute in my thinking-sphinx YAML file to the container's name, but get a FATAL: no AF_INET address found for: thinking_sphinx error on deployment. I suspect my issues are caused by a lack of understanding of inter-container communication in dokku (or most likely, docker), so I'll read up a bit more before I continue.

For the moment, I've got it running with a single web instance without any issues, so I'll move on to other things and come back to this next week.

Thanks once again for your suggestions! :)

gingerlime commented 6 years ago

just off the top of my head, some suggestions:

not sure, but I'll probably avoid underscores for (docker or other) addresses and use - instead? it might work, but seems safer to avoid. I'm not familiar with dokku, but with docker-compose it seems to work for us (we just use sphinx as the name);
Another thing worth checking is that you expose the sphinx port; and
add the equivalent of docker-compose's depends_on between your web and sphinx containers (web depends on sphinx)

dikey94 commented 6 years ago

I'd like to refresh this topic. What I end up with is installing the Sphinx binary inside the running container. Don't know how to share the configuration. My example is quite simple, there is only one instance of rails appliaction.

I'd appreciate any help.

ncri commented 5 years ago

If anyone is interested, I got this setup working too. Running sphinx in a separate docker container, which includes the app code and a slightly different thinking_sphinx config. The main app container containing the sphinx client needs to have the connection options set to address the sphinx container, e.g.:

  connection_options:
    host: "sphinx"
    port: "9306"

It is important that the thinking_sphinx config inside the container does not have these options. So when I build the sphinx image i overwrite the config file with a custom one made for the container, omitting connection_options.

Let me know if you have questions on how to set it all up.

dikey94 commented 5 years ago

@ncri I'm interested in your docker-compose.yml file and both Dockerfiles.

Thanks :]

ncri commented 5 years ago

These are the relevant parts of the docker-compose.yml (i omitted volumes and dependencies - also at the moment there is no volume for the sphinx indexes, they simply sit in the sphinx container):

  app:
    build:
      context: .
      dockerfile: Dockerfile.dev
    command: sh start_server.sh
  sphinx:
    build:
      context: .
      dockerfile: DockerfileSphinx.dev
    command: sh start_sphinx.sh

Dockerfile.dev:

FROM starefossen/ruby-node:2-4

RUN apt-get update -qq && \
    apt-get install -y nano build-essential libpq-dev && \
    npm cache clean -f && \
    npm install -g n && \
    n stable && \
    gem install bundler

WORKDIR /usr/src/app

COPY Gemfile Gemfile.lock ./
COPY components ./components
RUN bundle install

EXPOSE 3000

COPY . .

DockerfileSphinx.dev (code partly copied from: https://github.com/macbre/docker-sphinxsearch/blob/master/Dockerfile):

FROM starefossen/ruby-node:2-4
ENV SPHINX_VERSION 3.0.3-facc3fb

RUN apt-get update -qq && apt-get install -y \
        mysql-client unixodbc libpq5 wget

RUN apt-get install -y nano build-essential libpq-dev && \
    npm cache clean -f && \
    npm install -g n && \
    n stable && \
    gem install bundler

# set timezone
# @see http://unix.stackexchange.com/a/76711
RUN cp /usr/share/zoneinfo/CET /etc/localtime && dpkg-reconfigure --frontend noninteractive tzdata

# set up and expose directories
RUN mkdir -pv /opt/sphinx/log /opt/sphinx/index

# http://sphinxsearch.com/files/sphinx-3.0.3-facc3fb-linux-amd64.tar.gz
RUN wget http://sphinxsearch.com/files/sphinx-${SPHINX_VERSION}-linux-amd64.tar.gz -O /tmp/sphinxsearch.tar.gz
RUN cd /opt/sphinx && tar -xf /tmp/sphinxsearch.tar.gz
RUN rm /tmp/sphinxsearch.tar.gz

# point to sphinx binaries
ENV PATH "${PATH}:/opt/sphinx/sphinx-3.0.3/bin"
RUN indexer -v

WORKDIR /usr/src/app

COPY Gemfile Gemfile.lock ./
COPY components ./components
RUN bundle install

COPY . .

COPY ./config/thinking_sphinx_search_container.yml ./config/thinking_sphinx.yml

EXPOSE 9306

start_sphinx.sh:

rake ts:start

tail -f log/development.searchd.query.log -f log/development.searchd.log

piclez commented 4 years ago

Hi, I'm looking to finish my Sphinx setup with Rails and Docker but enable to find a complete working project. Anyone mind to share a working repo or gists? Thanks.

xtrasimplicity commented 4 years ago

All of mine are closed source, but I'll try to create an MCVE when I get a moment.

On Mon, 23 Dec 2019, 13:34 Peter Dirickson, notifications@github.com wrote:

Hi, I'm looking to finish my Sphinx setup with Rails and Docker but enable to find a complete working project. Anyone mind to share a working repo or gists? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pat/thinking-sphinx/issues/1010?email_source=notifications&email_token=ACHCTTMUCXFNBXXY2MJSZWTQ2APUFA5CNFSM4COSAMZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHQBIAA#issuecomment-568333312, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHCTTMKLQ5HBGQMTJ7KM3LQ2APUFANCNFSM4COSAMZA .

xtrasimplicity commented 4 years ago

Hi, I'm looking to finish my Sphinx setup with Rails and Docker but enable to find a complete working project. Anyone mind to share a working repo or gists? Thanks.

Hi @piclez, here's a gist: https://gist.github.com/xtrasimplicity/662d7bc33d6875bbd0a454110a289496

Note: I use real-time indices. I've also stripped this from a closed-source app, so there may be a few small things missing. Feel free to ping me if you have any issues. :)

We've been using this in production for about a year and it's been great!

jerome313 commented 4 years ago

Hi @xtrasimplicity I have used your setup, but whenever I run docker-compose up, the sphinx container exits because it is not able to connect with mysql server with this error:

Mysql2::Error::ConnectionError: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)

Please let me know if you have an idea about the cause for this.

xtrasimplicity commented 4 years ago

@jerome313 , make sure that your Sphinx container is configured to your DB container for SQL. That error means that it is trying to connect to the local socket and fails as there's no MySQL server running inside the Sphinx container, bound to a socket at that path.

I've updated my example Gist with a slightly newer approach.

pat / thinking-sphinx

thinking-sphinx with docker-compose #1010