python-arq / arq

Fast job queuing and RPC in python with asyncio and redis.
https://arq-docs.helpmanual.io/
MIT License
2.1k stars 173 forks source link

Enqueued jobs report "not found" until a worker sees them #342

Open theunkn0wn1 opened 2 years ago

theunkn0wn1 commented 2 years ago

In my application, the arq workers are not always online (e.g. down for maintainence, network issues).

When a job is enqueued to an arq task queue, it yields a task ID. If there are no workers observing that task queue, asking arq for the status of the job will produce the value "not found". This reading is false. The moment a worker reads the job from the queue, the job ID will resolve into the "queued" state as expected.

Expected behavior: jobs enqueued should either be in the "deferred" or "queued" state and read as such. Actual behavior: jobs will only have a status after an arq worker for the queue sees the enqueued job.

samuelcolvin commented 2 years ago

That sounds weird, redis already knows about the job obviously. PR welcome to fix this.

If you can't work out how to fix this, could you create a minimal example to demonstrate the problem?

theunkn0wn1 commented 2 years ago

Managed to reproduce in isolation. Here is a gist with everything needed short of a redis server.

https://gist.github.com/theunkn0wn1/a237cc816ec15a5a053bab11780c0bb4

Steps to reproduce:

  1. run arq_workspace.launcher_client an record the job token:

    /home/orion/.cache/pypoetry/virtualenvs/arq-workspace-Flug7Sf2-py3.10/bin/python -m arq_workspace.launcher_client 
    connecting to redis...
    spawning job...
    job spawned. your ID is 'dc17598c7b5e43a2a34d5b50ec9dbee2'
    your job token is my_agent:dc17598c7b5e43a2a34d5b50ec9dbee2
  2. Run arq_workspace.status_client and plug in the job token

    /home/orion/.cache/pypoetry/virtualenvs/arq-workspace-Flug7Sf2-py3.10/bin/python -m arq_workspace.status_client 
    Enter job id: my_agent:dc17598c7b5e43a2a34d5b50ec9dbee2
    connecting to redis...
    host='my_agent'; jid:
    spawning job...
    job not complete: not_found, sleeping before continuing...
  3. Observe the fact the status agent reports "not found" for a valid job ID and queue, created in step 1.

  4. Launch arq_workspace.agent, note that the agent picks up the item. also note that the status client will report the task as completed.

    /home/orion/.cache/pypoetry/virtualenvs/arq-workspace-Flug7Sf2-py3.10/bin/python -m arq_workspace.status_client 
    /home/orion/.cache/pypoetry/virtualenvs/arq-workspace-Flug7Sf2-py3.10/bin/python -m arq_workspace.agent 
    Starting worker...
    task_custom_add(x=4, y=6)
    ...
    job not complete: not_found, sleeping before continuing...
    JobStatus.complete
theunkn0wn1 commented 1 year ago

As an update to this, Arq's task status reporting capability is entirely unreliable.

I have now observed it reporting that jobs don't exist that are both actively executing, and previous requests to the same job ID reported running. Something with arq's task status reporting is horribly buggy.

I will need to implement my own thing to work around this bug.

euri10 commented 1 year ago

Just my 2 cents on this, have you tried the same gist without using a custom queue name @theunkn0wn1 ? I've never dug deeply into it because time's lacking, but every time I tried using custom queue names I ended up having issues, see https://github.com/samuelcolvin/arq/issues/348 for instance, and reverting to using the defaults proved more reliable