pricingassistant / mrq

Mr. Queue - A distributed worker task queue in Python using Redis & gevent
MIT License
878 stars 123 forks source link

Why mongo? #99

Closed DataGreed closed 9 years ago

DataGreed commented 9 years ago

Not an issue, really, just wondering, why mrq uses mongodb?

I mean, e.g. celery and python-rq use only one broker (redis or rmq) without additional databases.

I believe there is a pretty good reason for using two storages, so I wanted to ask about it :)

Thanks in advance.

sylvinus commented 9 years ago

Hi @DataGreed,

This is a legitimate question, that we will cover properly in some slides about mrq's design for the 1.0 launch.

Mongo is best suited to store and index documents, so that's where we store the metadata of the tasks. Redis works great as a shared data structure server, and its queues & sorted sets structures are a good fit for the semantics we need for a task queue. MongoDB would do a poor job at that, and Redis would do a poor job at storing document metadata (which is what RQ does, and makes the dashboard much less useful).

Some message brokers like RabbitMQ do both the queueing and the metadata storage, but they have very poor visibility and control on what's queued (see Celery's nightmarish implementation of cancels & retries).

DataGreed commented 8 years ago

@sylvinus wow, that's an interesting approach, thanks for the answer!

By the way, did you consider using redis hashes for that tasks or the problem is that you need more complex query to achieve smooth dashboard experience?

And, one more question: is mrq stable to use on production and on highloads? I've read some freightening articles about mongo locking under heavy loads

sylvinus commented 8 years ago

@DataGreed, RQ does store task metadata in redis hashes, but they are not indexed so not useful for a dashboard that has to manage thousands or millions of tasks.

MRQ is now stable for high workloads (we use it to run several million tasks a day). MongoDB does scale very well if you understand its specifics and there are countless articles on the web on that :)

DataGreed commented 8 years ago

Thanks! Will try

DataGreed commented 8 years ago

The dashboard is really brilliant!

DataGreed commented 8 years ago

@sylvinus may I ask some more questions about architecture and scaling? :) I was wondering how often is mongo used by mrq. As I understood so far, mrq uses redis as pubsub to notify workers about new tasks which they get from queus based on redis. Whenever a task is created, it hits mongo. it hits mongo again when the worker starts executing the task and third time when worker successfully executes the tasks or tasks fails to update execution time and results/stacktrace.

So mongo is hit at least 3 times per task for inserts/updates? by the way, can I delete completed tasks that are older than 1 day (for example) without breaking anything?

sylvinus commented 8 years ago

Thanks @DataGreed :)

When no failure happens on a regular queue, there are indeed 3 interactions with MongoDB: insert the task in status "queued", update its status to "started", update its status to "success". MongoDB can handle hundreds of writes / second even on modest configs so that should get you pretty far even without sharding.

We have recently added a mode where tasks that don't fail don't use MongoDB at all. Useful when you don't care about tracking which ones are started (see https://github.com/pricingassistant/mrq/blob/master/tests/test_raw.py#L283), and then your only bottleneck is Redis which has crazy fast writes.

Tasks have a result_ttl attribute (with a default setting in the config https://github.com/pricingassistant/mrq/blob/master/mrq/config.py#L155), it's the number of hours after which successful jobs are cleaned from Mongo.

DataGreed commented 8 years ago

Wow, @sylvinus thanks for the answer!

We have recently added a mode where tasks that don't fail don't use MongoDB at all.

That sounds even better!

JokerQyou commented 8 years ago

Maybe it's a little off-topic, but I want to ask if I could use mrq without installing MongoDB at all? I'm currently searching for a lightweight task queue system to deal with only network operations (sending HTTP requests). All the tasks are one-time operations, any failed operation will be retried immediately for several times and then we just throw it away. The test server has very limited resource so I'm asking if I could use Redis only? Thanks.

sylvinus commented 8 years ago

Hi @JokerQyou.

We use MongoDB as a data store for the task metadata and for making the tasks visible and filterable in the dashboard, so it's required for MRQ to work properly, sorry!