hotkit / django-async

A simple asynchronous execution Django application with proper database transaction management
http://www.kirit.com/Django%20Async
Boost Software License 1.0
35 stars 20 forks source link

Ability to run multiple workers without assigning jobs based on id partitioning #19

Open jainpawan opened 8 years ago

jainpawan commented 8 years ago

Hi, The current implementation which (worker) - outof (total workers) picks jobs by partitioning the id space. If for whatever reason one of the worker is stuck processing a job, all the current and future jobs assigned to this worker will stall. Is there any plan (or ideas) to make this more fault tolerant? i.e. a worker can pick any job that is scheduled to run now (irrespective of the id of the job)? This will also make adding and removing worker much easier.

KayEss commented 8 years ago

This is indeed a weakness of the current implementation. We do have some solutions in mind, but it will likely complicate the infrastructure needed to execute the queues.

There's a few things we do about this that helps. Keep jobs small -- if there is a lot of processing then use a group to manage the jobs. Progress can be tracked across the group. Jobs that are failing should just throw an exception and let the worker retry it later, i.e. don't have the job itself do any retries.

If you are using multiple workers it's really important that you think through the implications of different isolation levels -- I'd strongly recommend using SERIALIZABLE isolation. If your system won't run cleanly with that turned on then it may imply that you're actually getting some data corruption at lower isolation levels.

What we're looking at in the longer term is to have a process that makes use of the Posgres LISTEN/NOTIFY system to see new jobs and changes to jobs and then use that process to launch individual jobs, or batches of jobs. We run many microservices so this would allow us to reduce latency, increase parallel execution of jobs and do it with less workers overall.

KayEss commented 7 years ago

For other projects we've been developing a tool that would allow this, the wright-exec-helper. It multiplexes jobs using a fairly simple mechanism through printing and reading to/from stdout/stdin. The downside to it, from this project's perspective at least, is that it is native code.

The protocol is pretty simple though and it should be possible to implement something that performs the same function (albeit a bit more slowly) in Python.