Unreliable delayed jobs when using with bull

zlace0x commented 2 years ago

While using upstash + bull to process delayed jobs, we encountered weird bugs where job.data goes missing, happens randomly for 75% of all jobs. Unable to find root cause/issues on bull side. The timestamps & data section on taskforce takes a while to reflect on UI.

Relevant code:

 await chargeQueue.process(async (job) => {
    const { contract_address, work_payload } = job.data;
    if (!contract_address || !work_payload) {
      const jobData = await chargeQueue.getJob(job.id);
      throw new Error(
        "contract_address or work_payload cannot be null!" +
          JSON.stringify(jobData?.data)
      );
    });

Versions: Bull 4.1.1 & Bull 4.2.0

Switching to local redis or redis lab fixes this without any code changes.

Possible cause: 1) bull job data not propagating in time?

2) upstash vs redis set timestamp keys stored differently?

mdogan commented 2 years ago

Hey @zlace0x, thanks for the report. I'm able to reproduce it. It's related to tiered storage at Upstash and illegal(?) usage of EVALSHA in bull library.

First a bit background... Upstash has a tiered storage Multi Tier Storage which only keeps hot entries in memory. After some time idle keys are evicted from memory. In this case, delayed job keys are evicted from memory.

bull library uses EVAL / EVALSHA to insert/get/update multiple keys & data structures. Normally one should provide the specific keys used in Lua script explicitly to the EVAL / EVALSHA commands. See following statement from https://redis.io/commands/eval:

All Redis commands must be analyzed before execution to determine which keys the command will operate on. In order for this to be true for EVAL, keys must be passed explicitly. This is useful in many ways, but especially to make sure Redis Cluster can forward your request to the appropriate cluster node. Note this rule is not enforced in order to provide the user with opportunities to abuse the Redis single instance configuration, at the cost of writing scripts not compatible with Redis Cluster.

For the OSS Redis, EVAL works even when keys are not passed when running in standalone mode (when not clustered). For Upstash, this is an illegal case, regardless of being in replicated (multizone or global) or not (standalone), Upstash always uses tiered storage and loads the keys into memory according the given keys. When a EVAL accesses a key inside the script without explicitly providing the key, in current implementation Upstash cannot load the key if it is in cold storage.

So, in your case, bull tries to update a HASH belonging to a job with a Lua script without passing the key. The reason for bull is, it fetches the related keys in the same script dynamically. See this script: https://github.com/OptimalBits/bull/blob/master/lib/commands/updateDelaySet-6.lua. It queries the jobs and then calls HSET .. for each jobId.

I'll discuss this with the team, how can we solve this or at least can provide a workaround.

manast commented 2 years ago

@mdogan just chiming in as Bull/BullMQ maintainer. I would like to help in fixing this issue, maybe we can find a workaround together with a combination of improvements in both upstash's Redis implementation and in the libraries I maintain. Feel free to contact me privately if you feel you need to: manast@taskforce.sh.

ctrlaltdylan commented 2 years ago

I would like to use Upstash with Bull as well. I'm able to queue jobs, but processing seems to be an issue.

chronark commented 1 year ago

Closing in favor of #18

upstash / issues

Unreliable delayed jobs when using with bull #17