Closed itsjohncs closed 10 years ago
A SO answer from the lead developer on Celery describes which backend is most appropriate for which scenarios (definitely prefers RabbitMQ): http://stackoverflow.com/a/9176046/1989056.
I know RabbitMQ is a bit difficult to set up so I would prefer not to use that by default given that Galah's requirements of its queue is not very high. Celery must provide all of the features we want with Redis at least (just using the database also sounds great).
The pertinent part of the Celery documentation is at http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#application.
It looks a lot like what Sisyphus looks like. It's not sufficient to just replace Sisyphus though since we haven't really had any problem with him and he wouldn't benefit massively from this. This needs to be able to support the build servers as well.
It supports storing the states of tasks which was a major thing I wanted to figure out how to do so that it could be exposed to the user: http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#keeping-results. This would be a lot of error-prone code if I implement it myself.
Celery can handle the sheep's consumer tasks definitely, I am now trying to figure out where the producer comes into this. I'd like to come up with a rough sketch of how the system would look with Celery at the helm.
Celery abstracts away the messaging altogether and the way we would, say, start a test request is by creating a file like:
from celery import Celery
app = Celery('tasks', broker='redis://guest@localhost//')
@app.task
def run_test(assignment, files, etc):
import buildserver.bla
buildserver.bla.run_test(assignment, files, etc)
Then within a web worker we'd have...
def upload_assignment():
run_test(this_assignment, their_files, etc)
return "It's going!"
My worry is that this level of abstraction might be too high. Could I have different workers each with different environments and would they be able to only take tasks they can actually handle? Celery looks like it would be excellent for the types of tasks that Sisyphus deals with, but Sisyphus's code is so simple I don't need some fancy framework to make him work well.
Could I have different workers each with different environments and would they be able to only take tasks they can actually handle?
Yes. See http://docs.celeryproject.org/en/latest/userguide/routing.html. Specifically, since my routing needs will (eventually) be complex, we can use a custom router. This routing also solves another problem I was worried about which is that I didn't want Sisyphus's tasks to hold up the testing server tasks for obvious reasons.
My next question is how exactly can I see when a particular Celery task will get executed (or rather, how many tasks are in front of it) so I can expose this information to the user?
Celery provides a tool that would be very useful to system administrators and would allow straightforward introspection on the health of the server: http://docs.celeryproject.org/en/latest/userguide/monitoring.html. Lots of bonus points for that.
Celery looks like it uses a fairly sane deprecation strategy: http://docs.celeryproject.org/en/latest/reference/celery.app.task.html#celery.app.task.Task.accept_magic_kwargs. More bonus points.
Celery still gets lots of attention from its maintainer and there are many commits every day. Most of the issues that come through get closed in a timely fashion, and pull requests are dealt with as well. The project looks healthy, though the frequency of changes worries me because it makes me question its stability. The mailing list also looked healthy.
I'd say that if I want to use Celery for its features I won't be disappointed by the support I get.
My next question is how exactly can I see when a particular Celery task will get executed (or rather, how many tasks are in front of it) so I can expose this information to the user?
You can access information on a particular task through its unique ID which I can store alongside its submission, but it doesn't seem that we can expose how many tasks are in front of it, which is unfortunate. We can expose the state however which will tell the user whether its been started or has somehow been lost, which is probably sufficient.
My next question is how Celery would handle the fact that the sheep can only execute build requests when virtual machines are available?
... how Celery would handle the fact that the sheep can only execute build requests when virtual machines are available?
It looks like the best way to do this would be to block until the resource is available and then execute the task. This combined with proper routing to ensure that a particular worker's only job is to consume test requests in this fashion should provide for good behavior. I could manually update the state of the task to STARTED
when the resource is available, and I believe celery will be able to handle the situation where the worker crashes in the middle of processing a request but I will have to look into that further (there will be a way, not sure if it happens automatically though). This seems entirely acceptable and should work out just fine.
Alternatively I can use the retry mechanism, but my worry there is that users would lose their place in the queue and the logic might get a little weird. It would be hard to predict when tasks would get executed.
Celery looks like a good choice, but the golden question is now will it be a good choice a year from now? Things to consider are:
I am now embarking on a search for complaints about Celery and alternatives that have sprung up. I may make a reddit post to get lots of zealot-type answers from users of other tools. I need differing perspectives here.
This is very bad given my long term plans for Galah. Something redeeming is that the maintainer deprecates things properly so that your code doesn't actually break for a long time, but this still doesn't sit well with me. The sane deprecation strategy seems to make this OK, but I'm still very worried. This might be a deal killer.
This is bad for obvious reasons. The presentation here is amusing and touches on this. Given the quality of support that has been provided to others though, this isn't necessarily terrible. Projects gain complexity over time unfortunately and rarely do you see a stable project that doesn't have quite a bit of complexity in it. In fact, complexity can often be attributed to the number of bug fixes and edge cases its capable of handling imho.
There are some alternatives listed at http://seeknuance.com/2012/08/14/alternatives-to-using-celery/. Gearman looks like the most promising of them. A big problem I have looking at Gearman is it implements its own queueing solution rather than using an established backend. This could be fine if Gearman has a solid pedigree though.
I will now be investigating Gearman and keeping an eye out for criticism of Celery.
One of the presentations linked above mentions another alternative called RQ. I have disqualified it already but want to document why.
Looking at the development history it's clear it is slow. Huge month-long gaps in development. This would be OK if the current version wasn't 0.3.13. There's also many issues open in the tracker that should be getting attention but clearly aren't. This is a blocker for me and I won't take on any dependency that we won't receive excellent support from given the needs of Galah.
Stability of the project and its dependencies
Celery has been around for a pretty long time and is the de-facto solution for async task/job queues. It's used in production by some pretty big names - Instagram in particular, and is sponsored by Rackspace. It's pretty safe to say that it's not going to go anywhere soon.
Regardless, I've used Celery in projects which are running in production now w/ RabbitMQ and IronMQ transports and can vouch for its stability & ease of use. API does tend to change frequently, and a few neat features have been added from 3.0.x to 3.1.x. I've yet to have existing code break in a major way from bumping up versions though.
Debugging potential for system administrators inspecting the system
If you are using RabbitMQ as a transport, https://github.com/mher/flower is amazing to see what's going on with your workers.
Thank you @jhgg for the perspective! I think Celery is probably going to be the way to go if I want to use a library like this. I'm now investigating whether it would be better to simply interact with Redis directly rather than using any kind of library, which was my original plan before I ran across Celery. Galah needs stability more than it needs nice features.
Flower looks pretty awesome and I'll definitely be asking the system administrators at UCR what transport they would prefer given the tooling available for them.
So I guess what I want to do now is list the features of Celery that I'd like to use, and figure out the cost of developing those features myself in terms of code complexity and time.
If you're looking for a more feature-slim version using Redis exclusively as a transport, check out https://github.com/nvie/rq.
Thanks for mentioning RQ. I saw that but certain things about it worried me. I don't think RQ would provide a lot of benefit over using Redis directly either as far as features go.
Ah, whoops. Kind of scrolled over that. The codebase for RQ is much less magic than celery, which is layers on layers of piled abstractions (kind of needed though, given the multitudes of transports they support). Either way, I've been using Celery for a few years now and it hasn't gone wrong on me yet.
Ah, whoops. Kind of scrolled over that.
Haha, no problem, there's a lot of comments in this thread, and nearly all of them are just me talking to myself :flushed:.
I definitely like the strong user base that Celery has. I'm surprised there aren't more contributors given that actually. The mountain of abstractions are very worrisome though. I imagine I could only figure out how Celery is doing things by inspecting whatever transport its using directly. This will be fine if I want some more advanced features from it, but I imagine that using Redis directly could be extremely straightforward. Doing some good ole' fashioned pencil and paper diagrams right now to figure out how using Redis directly would look.
I have described, in significant detail, what features I want from the transport in b57a59bae3276dc20ddcd019b3cd4db1a3ff4c53. I speak specifically of Redis in that document but I'm still not positive one way or the other. The specific, potentially hard-to-implement features I want are:
My thoughts on implementing each of these myself on top of Redis.
retry_count
field on requests and tasks.It would all require significant code to be written though. Specifically a lot of code for the careful handling of tasks so they aren't dropped on the floor via Redis scripts.
So it's clear that using Celery will make initial development faster. But will it give me more troubles down the road and at some point will I be trying to move away from it just like how I'm moving away from ZeroMQ now (which was a heavy abstraction atop sockets, just like this is a heavy abstraction atop Redis)?
If Redis crashes (or more likely the machine its on disappears), how will Celery respond?
If Redis crashes (or more likely the machine its on disappears), how will Celery respond?
This doesn't seem to be well documented which is very worrisome.
A Mozilla dev mentions that their tasks are idempotent (can be run multiple times without anything bad happening) which is something I probably want to make sure is true no matter what.
At the end-user interface side, I need to assume tasks can fail or get lost anyways, so if Celery loses a task because of some horrible failures the end-user should be presented with acceptable options anyways (probably retrying the task).
This question hits on my still very large concern that Celery's abstractions will make it hard to know exactly what's going on under the hood. If I use Redis directly there will be very few mysteries in Galah, and the hit to the amount of code wont be massive, it will just be moderate. Therefore the number of additional bugs from misimplemented features that Celery would have provided bug-free is probably just moderate...
I don't think Celery's support of multiple queuing backends will end up being all that useful. It would be nice to use a single database rather than Redis, but Celery's support for database backends is experimental anyways, so I probably wouldn't want to use that code. Something I've been neglecting to consider is simply using MongoDB for my queuing needs. Its capable of doing everything Redis can do and more. PostgreSQL is the same way. This would cut out the dependency altogether. The only downside would be performance and that's not really a big deal given our needs are simple.
Using our database as a queuing solution looks like a solid idea. It won't be any more complicated than Redis (it is in fact more powerful than Redis in most ways and is only missing the publisher/subscriber feature) as far as I can tell, and system administrators won't have to deal with yet another system. This could be really nice.
Because of the flexibility of the model we can of course offer Redis as an option in the future if it is desired without changing anything (or at least without changing very much) outside of the core.
An article describing how someone replaced RabbitMQ with MongoDB: https://blog.serverdensity.com/replacing-rabbitmq-with-mongodb/
Victor Hill (sysadmin at UCR) didn't have a particular preference between Celery and Redis, but he did express dislike towards using a database for queuing.
Victor also mentioned that he'd prefer to use his own tool rather than flower or another Celery monitor to visualize the data. Celery's command line tool would still be useful, but it's not like that's all that hard to implement ourselves.
I think that I'd like to use Redis directly rather than use Celery. My needs are not very complicated and the most powerful thing I'd get from Celery is the rate-limiting, but it's not like I can't implement that adequately myself. This will require more preliminary work but I believe we will end up with less bugs in the future, and the bugs we do have (I expect) will be easier to debug because of less abstraction.
This was a hard choice and I reserve the right to back out of it if, when we start implementing things in Redis, its clear that things are getting too complicated. Closing this issue for now though.
Celery would supposedly be capable of abstracting away the queuing backend while providing a lot of features that we would probably mess up if we tried to do them ourselves. Celery supports RabbitMQ and Redis (along with many databases but in a limited capacity).