clockwork queing delayed jobs

openSUSE / open-build-service

Build and distribute Linux packages from sources in an automatic, consistent and reproducible way #obs

https://openbuildservice.org

GNU General Public License v2.0

931 stars 437 forks source link

clockwork queing delayed jobs #3341

Open hennevogel opened 7 years ago

hennevogel commented 7 years ago

The clockwork setup assumes that the time between scheduling a delayed job and the execution of the job is smaller than the interval. For instance

every(30.seconds, 'send notifications') do
   SendEventEmails.new.delay(queue: 'mailers').perform
end

doesn't mean that mails are sent every 30 seconds. It means that a new SendEventEmails job is added to the queue every 30 seconds.

Which means that if either the queue has other jobs that take more than 30 seconds, or if the execution of SendEventEmails takes longer than 30 seconds you will have 2 SendEventEmails jobs.

This is naturally often the case which means that for instance on build.o.o we have an ever growing queue of SendEventEmails.

mdeniz commented 7 years ago

Then we need to check first if there is one job pending for this queue and then skip the creation of another job in that case....

hennevogel commented 7 years ago

Look at #3342, we don't really use queues. And this would mean a solution to the ever growing queue. But it also would mean a completely different behavior for SendEventEmails. Right now there is something like a constant job for it. As soon as one SendEventEmails finished the next one runs (because there is always one in the queue)....

mdeniz commented 7 years ago

I now we were not using it and we are now going to start using it... my point was more about if we want to have just one job in the queue pending not a ton of it... per job kind/queue. In this case maybe makes sense to only have one job pending

evanrolfe commented 7 years ago

Or we can do something like this in SendEventEmails job so that a subsequent job will not try and send emails for an event that is currently being processed by the previous job:

def perform
  events = Event::Base.where(state: :unprocessed).order(:created_at).limit(1000)
  events.update_all(state: :processing)

  events.each do |event|
    # Send emails, create notifications, create log entries etc.
    event.update_attributes(state: :processed)
  end
end

hennevogel commented 7 years ago

That would be the best solution I guess. Would also make it possible to run more than on DJ worker for the mailers queue.

Sounds also like a strategy for all the other jobs.

mdeniz commented 7 years ago

If we are going to kind of "locking" records per each job, we also need to take care about failures. At least to not let them in 'processing' state forever. I see two possible cases:

What if something inside the loop raises an exception
What if the worker process is stopped / killed

Can be that we use a timeout for that 'processing' state? That way we will retry those on 'processing' state after a while.

Also we can use the id of the worker / job in a field to know who is the responsible for processing that event...I don't know if this could help in any way

evanrolfe commented 7 years ago

Can be that we use a timeout for that 'processing' state? That way we will retry those on 'processing' state after a while.

DelayedJob is built to handle timeouts for jobs which take too long and retries etc. . I think we should use their functionality rather then re-creating it in the Events.

We could assign event ids to each job for processing i.e. send_event_emails.rb

def perform(event_ids)
  events = Event::Base.find(event_ids)

  events.each do |event|
    # Send emails, create notifications, create log entries etc.
    event.update_attributes(state: :processed)
  end
end

And then in clock.rb:

  every(30.seconds, 'send notifications') do
    events = Event::Base.where(state: :unprocessed).order(:created_at).limit(1000)
    event_ids = events.pluck(:id)
    events.update_all(state: :processing)

    SendEventEmails.new(event_ids).delay(queue: 'mailers').perform
  end

That way its DelayedJob's responsibility to take care of stuck or failed jobs and stuff like that.

mdeniz commented 7 years ago

Mmmm, but then we will need a way of recovering that failed jobs, I mean, a job is retried maybe 3 times (depending on the config), but after that those ids are never retried because the job is failed.

I rather prefer to have jobs that just process events, the pending ones... no matter what ids they have. That way we will have to take care about events failed not about jobs failed... jobs will never fail.

We will end up having single events that maybe because a wrong email are not possible to process for a subscriber. But at least not a batch of 1000 events that will remain unprocessed because the first event in that batch is impossible to process until an admin goes there to see what happen in that job and fix it.

In terms of fixing things If we have single events failing then is easier to fix what is causing the problem instead of having to deal with all the possible problems in 1000 events to have something being processed.

evanrolfe commented 7 years ago

I rather prefer to have jobs that just process events, the pending ones... no matter what ids they have. That way we will have to take care about events failed not about jobs failed... jobs will never fail.

LGTM

hennevogel commented 7 years ago

There is also https://github.com/collectiveidea/delayed_job#hooks but failing jobs should not be about batches, I agree.