seedofjoy / darq

Async task manager with Celery-like features. Fork of arq.
http://darq.readthedocs.io
MIT License
48 stars 6 forks source link

Queue stuck #423

Open cyberbudy opened 2 years ago

cyberbudy commented 2 years ago

I have a problem from time to time when queue is stuck on some message without any progress until most of the messages are invalidated by time. Restart or adding/removing clients manually don't help. Strange thing is CPU/RAM usage stays the same. Maybe there is some kind of limitation when too many records in queue? If anyone could help me at least how to debug a problem, because messages are not processing and logger has no new records either. Only such records from time to time

redis_version=6.2.4 mem_usage=3.23G clients_connected=536 db_keys=6685280
recording health: Mar-31 10:04:29 j_complete=755 j_failed=1 j_retried=0 j_ongoing=0 queued=511

P.S. thanks for a great project. Have not seen anything better for asyncio

seedofjoy commented 2 years ago

Hello! Can you please describe how many instances are running and which command you use to start workers? It would also be great if you could share your Darq args&kwargs (app = Darq (.....))

Maybe there is some kind of limitation when too many records in queue

We use Darq in a big project, there're a lot of workers & thousands of completed tasks by day, so I don't think that's a problem

You say that queue is stucking. By default worker processes 10 tasks asynchronously (max_jobs param). But, for example, if one of your task blocks the event loop - it will block the entire worker. In this case even timeout will not work. But there are no "ongoing tasks" in your log, so the problem is probably in something else. You are sure that queued tasks are not "defered" (waiting for specified time to start)? You add tasks to the queue with .delay(), not .apply_async(defer_by=...), right?

Also there is a small chance that there are some bug in aioredis with Redis 6.x. Because aioredis 1.3.x was not tested with Redis 6.x.

cyberbudy commented 2 years ago

I'm running 3 darq instances with 5 to 10 thousands small tasks per day(io bound communication). I have 3 tasks.

This is my darq setup

darq = darq.Darq(
    redis_settings=darq.RedisSettings(
        host=settings.REDIS_HOST,
        port=settings.REDIS_PORT,
        database=settings.REDIS_DB,
    ),
    on_startup=startup,
    on_shutdown=shutdown,
    keep_result=0,
    max_jobs=100,
    queue_name='queue',
    job_timeout=3600
)

# Task example
@darq.task(queue='queue')
async def send_message_task(message_id: str):
    pass

# And is called always as
await send_message_task.delay(message['id'])

Yes, I thought that it may be a blocking issue, but I guess there should be at least a log record after restart about a new message. Alose because redis and darq instances when "stuck" uses the same amount of resources, it must be not that case.

cyberbudy commented 2 years ago

I see, aioredis released a new 2.0 version with some backwards incompatible api. Are there any plans of upgrading to a new version?

cyberbudy commented 2 years ago

https://github.com/samuelcolvin/arq/pull/258

Newest version of arq supports aioredis 2.0, by using redis-py :)

seedofjoy commented 2 years ago

I'm sorry to be late with the reply.

Yes, Darq will have support for the new redis-py. Until recently, in my personal opinion, aioredis 2.x was not yet production-ready.

Speaking about your issue: are you still facing the problem? It seems that your tasks are not blocking the loop, because health check is working, and "ongoing" tasks = 0.

After your worker will stuck can you check some darq keys in redis? For example if there any keys arq:in-progress:*?

Also try to set queue_read_limit by formula: "number of workers" * "max_jobs" (at least). In your case, I think, you can try queue_read_limit=300. This can increase performance in overall.

cyberbudy commented 2 years ago

Glad to hear that. It's great to have such project. In about a year of using darq I encountered this situation twice. Thanks for advice I'll try to check it next time.

cyberbudy commented 2 weeks ago

@seedofjoy I've found the problem When there are a lot of tasks in arq:in-progress: new workers cannot start