allow callbacks as background jobs

akostadinov commented 2 years ago

Current project I'm working on tries to do index work as background jobs. Where index worker method is like this:

  def perform(model)
    callback = ThinkingSphinx::RealTime.callback_for(model.class.name.underscore)
    callback&.after_commit model
  end

This mostly works but not for models using STI (single-table inheritance). For models where index is defined on the base_class, then one has to use model.class.base_class and for models where index is on descendant, then #base_class shouldn't be used. So it is an error prone and ugly business.

Would be useful to be able to add callbacks like:

ThinkingSphinx::Callbacks.append(self, {behaviours: [:real_time], background: true})

Or some other way to allow that easily.

It also applies to ts:index. It would be useful to queue indexing in background job batches. Presently we use the above indexation worker and can't benefit from the new clean-up functionality.

== thinking-sphinx 5.4 on rails 5 (in process of upgrading).

/related to #1038/

akostadinov commented 2 years ago

FYI this is how I implemented it in the project:

class SphinxIndexationWorker < ApplicationJob
  def perform(model, id)
    indices_for_model(model).each do |index|
      instance = index.scope.find_by(model.primary_key => id)

      if instance
        reindex(index, instance)
      else
        delete_from_index(index, id)
      end
    end
  end

  private

  def indices_for_model(model)
    ThinkingSphinx::Configuration.instance.index_set_class.new(classes: [model])
  end

  def reindex(index, instance)
    # some indices are defined on model#base_class (*Plan) some on model itself (CMS::Page)
    callback = ThinkingSphinx::RealTime.callback_for(index.model.name.underscore)
    callback&.after_commit instance
  end

  def delete_from_index(index, *ids)
    # This is how deletion is performed by ThinkingSphinx::ActiveRecord::Callbacks::DeleteCallbacks#delete_from_sphinx
    ThinkingSphinx::Deletion.perform index, ids
  end
end

But it is good if it allows customization for different models. Because for accounts I have a little optimization to also soft delete dependent accounts:

    indices_for_model(model).each do |index|
      if account && index.scope.find_by(pk => id)
        reindex(index, account)
      elsif account && !account.master?
        delete_from_index(index, id, *account.buyers.pluck(:id))
      else
        delete_from_index(index, id)
      end
    end

pat commented 2 years ago

I have considered allowing real-time updates/deletions to occur via background jobs… but I've held off because I'm not sure if it's much of a performance benefit?

If it happens immediately:

Potentially run SQL to pull in associated data for the Sphinx index schema.
Send the update statement to Sphinx.

Compared to it happening via a background job:

Store the job details in a data-store (e.g. Redis)
Load the instance from the database again.
Potentially run SQL to pull in associated data for the Sphinx index schema.
Send the update statement to Sphinx.

I guess it's very much an app-specific consideration. There's also the challenge of providing appropriate integration with ActiveJob vs background job gems directly - for example, my understanding is that Sidekiq's performance is better when used directly rather than via ActiveJob.

This mostly works but not for models using STI (single-table inheritance). For models where index is defined on the base_class, then one has to use model.class.base_class and for models where index is on descendant, then #base_class shouldn't be used. So it is an error prone and ugly business.

This does sound painful - and I'm a little surprised, because I would hope that using the index_set_class with the classes option should navigate the hierarchy of inherited classes correctly. If you're finding that's not the case, then that sounds like a bug to me. Can you provide an example of where it's not working?

pat commented 2 years ago

And just to clarify: I'm not against background jobs for running callbacks in theory, but right now it feels like it involves adding too much code to Thinking Sphinx for what feels like an edge-case. I wonder if it could be done as a separate gem instead of adding to TS directly?

akostadinov commented 2 years ago

@pat, thank you for looking into it. I understand that it is hard to implement this in a way to fit everybody.

I don't know about ActiveJob vs direct Sidekiq performance penalties. In my current project we use both and handle a huge amount of jobs. I hope that later this year we will have better monitoring and data to see if there is any practical difference. In general I'm always in favor of using a standard instead of some implementation.

I think it is practical though, to first make it easy to implement a custom worker and then a common solution may more naturally occur.

Getting indices:

ThinkingSphinx::Configuration.instance.index_set_class.new(classes: [model])

(Re)index:

    callback = ThinkingSphinx::RealTime.callback_for(index.model.name.underscore)
    callback&.after_commit instance

Delete from index:

ThinkingSphinx::Deletion.perform index, ids

The above is messing up with library internals. If one can do

ThinkingSphinx::RealTime.index(instance)
ThinkingSphinx::Deletion.perform(model, id)

Or some equivalent easy way, that would be a huge improvement IMHO and allow indexing operations anywhere in object's lifecycle.

My issue with STI is most likely because I'm using library internal details to get the callbacks on the fly (as written above) and that's different from calling the actually registered callbacks on a model.

While on it, another benefit from doing indexing in the background is that this not only reduces request latency (Redis is supposedly fast) but also decouples requests from indexing so Sphinx errors would not fail it. Although Sphinx seems to be very stable.

pat commented 2 years ago

This is a great suggestion. I've just added the following to the develop branch if you want to give it a shot:

processor = ThinkingSphinx::Processor.new(instance)

# to delete from all relevant indices:
processor.delete

# to add/update in all relevant real-time indices:
processor.upsert

While on it, another benefit from doing indexing in the background is that this not only reduces request latency (Redis is supposedly fast) but also decouples requests from indexing so Sphinx errors would not fail it. Although Sphinx seems to be very stable.

That's a good point about decoupling Sphinx errors - though yes, hopefully Sphinx is pretty stable with receiving updates! 😅 But perhaps this new processor class helps with your background job approach?

akostadinov commented 2 years ago

In the background job I already don't have an instance. I only have an id. So it will not help for the delete case. Any suggestions?

pat commented 2 years ago

Belatedly coming back to this, sorry for the delay! I've pushed a commit today to the develop branch that changes the possible arguments for ThinkingSphinx::Processor:

processor = ThinkingSphinx::Processor.new(instance: instance)
# or
processor = ThinkingSphinx::Processor.new(model: Model, id: 123)

Hopefully this covers what's needed for you with the delete case?

akostadinov commented 2 years ago

I looked at the commit and it looks great. Thank you! I have some very high priority things to do at the moment but I hope to be able to try out this within 2 weeks.

pat commented 1 year ago

Half a year later, but finally pushed a new gem release that includes these changes (v5.5.0). Do let me know if you give it a shot and whether there's further evolutions that would be helpful!

In the meantime I'm going to close this issue, but happy to re-open later if required.

softbrada commented 1 year ago

is there a chance something similar to be implemented for plain indices also ? so they can be reindexed in parallel?

akostadinov commented 1 year ago

This fell through the cracks, sorry about that. I just now checked this better. It still not fully useful to upsert or delete depending on index scope :grimacing:

Will you be open if I later submit a new method Processor#update that will check whether instance is part of the index scope and then decide to upsert or delete it from that index?

Presently I see that model.find is used which doesn't account for the scope.

pat commented 1 year ago

No worries about the delay @akostadinov - clearly I'm not being super prompt at responding to open-source things at the moment (life is extra busy this year).

I'm not entirely against a PR, I guess my concern is: is the behaviour of deleting if outside of a scope tied to your application, rather than a more general approach? (i.e. where deletions would only happen if the record is no longer in the database)

pat commented 1 year ago

@softbrada very sorry for the slow reply - perhaps you've found a different approach, but I just wanted to note that plain (SQL-backed) indices don't really handle parallelisation, because SQL-backed indices require rewriting files. If one index/file is being rewritten by one process, no other processes can/should touch it (and both Sphinx and Thinking Sphinx have some safeguards around this).

akostadinov commented 6 months ago

@pat , please consider this pull request #1258

I think it is not application specific. If one specified an index scope, then most expected would be for that scope to filter which objects should be indexed and which - not. I found another issue that sounds similar or the same request - #1253 .

This obviously comes at an additional database roundtrip cost but since the new method is optional to use, then I think it should be fair.

The existing use cases should remain the same.

pat / thinking-sphinx

allow callbacks as background jobs #1215