pat / thinking-sphinx

Sphinx/Manticore plugin for ActiveRecord/Rails
http://freelancing-gods.com/thinking-sphinx
MIT License
1.63k stars 468 forks source link

Better support for sharding in lifecycle callbacks #1166

Closed lunaru closed 4 years ago

lunaru commented 4 years ago

@pat I wanted to create this issue to start this discussion to make sure we've exhausted all of the documented solutions.

Right now, we have a model with enough records that a single index is not practical, so instead, we have something like this:

N.times do |i|
  ThinkingSphinx::Index.define :document, name: "document_#{i}", :with => :active_record, :delta => ThinkingSphinx::Deltas::SidekiqDelta do

    where "MOD(...) = #{i}"
  end
end

However, this as the problem that when a record is modified, it ends up creating N delta jobs, one for each index. If N is large enough, this becomes a problem.

It looks like the offending code is here: https://github.com/pat/thinking-sphinx/blob/1adcf8159fdbcb96a70fb01f5d8a15c2070090e6/lib/thinking_sphinx/active_record/callbacks/delta_callbacks.rb#L16

So the question is: Is there a better way to do sharding that avoids this landmine? Or conversely, if this is the best way to shard, then would a PR to help solve this problem be welcome?

pat commented 4 years ago

Hi @lunaru

This is a great point, and you're right that the existing solutions aren't ideal. I was thinking about whether a custom IndexSet class could help, or a subclass of ThinkingSphinx::Deltas::SidekiqDelta, but neither is going to provide the ability to filter indices based on the current model instance, just the model.

If there are ideas for PRs, please do share, because I'd love your experience (especially given I'm not using a sharding approach for any of my apps). I'm currently wondering if changing IndexSet filtering to accept an instance (much like it accepts a model class, see line 46 of that delta callback file), and then that would allow picking and choosing specific shards based on that instance. Other ideas are welcome too!

Also: I guess another way is to use real-time indices rather than deltas, and then you've got a touch more control over the callbacks in the model, and can invoke them a little more specifically. But that is a significant change if you're comfortable with SQL-backed indices.

lunaru commented 4 years ago

@pat thanks for sharing your thoughts. The idea that we have to solve this would be to allow the model to optionally implement ts_index_names.

Each of the Callback classes would then check instance.respond_to? :ts_index_names and if it does, it will select the indices against those names to narrow the indices that get update/delete/etc calls.

If you think this approach is agreeable, we'll take a stab at the PR ASAP.

pat commented 4 years ago

Hi @lunaru - sorry for the delay in getting back to you, but I've just merged some changes into the develop branch that should help solve this issue: I've reworked IndexSet so it accepts specific instances, and that's what the callbacks pass through. Which means that something like the following should help:

class IndexSet < ThinkingSphinx::IndexSet
  private

  def indices
    return super if instances.empty?

    super.select do |index|
      # this block below will return all core and delta indices for the
      # instance's model, filtered by the shard id for the instance.
      instances.any? { |instance| index.name[/_#{instance.shard_id}_$/] }
    end
  end
end

# This line would go in an initialiser:
ThinkingSphinx::Configuration.instance.index_set_class = IndexSet

The shard_id method in the example could be replaced with your MOD logic, and thus returns the integer of the index an instance is tied to. If you'd like to give this a shot and confirm if it works for you, that'd be great! Then I can look into getting a new gem release published.

lunaru commented 4 years ago

@pat awesome, we’ll give this a try this week (hopefully Monday) and let you know. I take it that your refactor is already live on the master branch and we’d just need to pin our gemfile to master to use it?

pat commented 4 years ago

master is for released gem versions, so it's only in develop for now, but that's considered stable so there shouldn't be any issues using that branch as the source.

lunaru commented 4 years ago

@pat Good news: It looks like this works brilliantly. The only difference with the index_set_class call had to be done differently because it looks like the configuration was getting wiped upon initialization. Here's how we ended up doing it:

class ShardedIndexSet < ThinkingSphinx::IndexSet
  private

  def indices
    return super if instances.empty?

    super.select do |index|
      # this block below will return all core and delta indices for the
      # instance's model, filtered by the shard id for the instance.
      instances.any? { |instance| !instance.respond_to?(:shard_id) || instance.shard_id.blank? || index.name[/_#{instance.shard_id}_/] }
    end
  end
end

# this needs to be set after initializers for the thinking-sphinx gem are run
Rails.application.config.to_prepare do
  ThinkingSphinx::Configuration.instance.index_set_class = ShardedIndexSet
end

Specifically Rails.application.config.to_prepare runs late enough to let us override the default index set class.

lunaru commented 4 years ago

I'll go ahead and close this issue and await the next release version

pat commented 4 years ago

Thanks for the confirmation, and the note about the initialiser. I'll try to get the new gem release out soon!

pat commented 4 years ago

I know I've already noted this in #1173, but wanted also to mention that the newly released v5.0 includes the change discussed here with instances being available in custom IndexSet subclasses :)