taskforcesh / bullmq

BullMQ - Message Queue and Batch processing for NodeJS and Python based on Redis
https://bullmq.io
MIT License
6.18k stars 405 forks source link

Parent job does not execute when child job fails #800

Open m4nuC opened 3 years ago

m4nuC commented 3 years ago

I have a FlowProducer that runs a parent job after a set of children jobs have been executed. There are about 10k children that are being run concurrently. I've set the option removeOnFail to true on both the children and the parents but it seems that if a child fail the execution of the parent just hangs.

Could this be a bug or is it a configuration issue on my side?

Here is what the code that create the job looks like:

export const initializeCollectionStats = async () => {
  const flowProducer = new FlowProducer({connection: redisClient, sharedConnection: true});
  const totalSupply = 10000
  const tokenJobs = [];
  for (let i = firstTokenID; i <= totalSupply; i++) {
    tokenJobs.push({
      name: 'child',
      data: {
        hello: 'world'
      },
      queueName: "queueName",
      opts: {
        removeOnFail: true,
        attempts: 2,
        timeout: 3000
      }
    })
  }

  const initJob = await flowProducer.add({
    name: 'initProject',
    data: {
      contractAddress
    },
    queueName: PROJECT_INIT_QUEUE,
    children: tokenJobs,
    opts: {
      attempts: 1
    }
  })

  return initJob
}

and the worker that run the children job looks like this:

const worker = new Worker(QUEUE_NAME, async (job) => {
    const { hello } = job.data
    try {
        const metadata = await doSomethingAsync(hello)
        return "done"
    } catch (e) {
        logger.error(e)
        return null
    }
}, { connection: redisClient, concurrency: 300, sharedConnection: true});
manast commented 3 years ago

Currently by design, until all child jobs have been completed the parent job will not be processed.

m4nuC commented 3 years ago

@manast Gotcha thanks, so then the only way would be to catch errors and manually mark child jobs as completed or is there an option for this ? I assumed that removeOnFail was doing something like this.

m4nuC commented 3 years ago

@manast to clarify what I am trying to achieve is:

manast commented 3 years ago

Regarding the two last points:

  • If max amount of retries is reach on children, then move job to fail
  • Once all child jobs are either failed or completed, then run parent.

This would imply new functionality. Namely that a parent job should be "processed" OR "failed" depending on different criteria. For example, it could be the case that we want the parent to fail if one child fails already, OR as you wrote complete if all children either fails or completes, etc.

Slind14 commented 2 years ago

This affects us as well. Is there a possibility to see a configurable outcome in the near future?

manast commented 2 years ago

@Slind14 what is your usecase?

Slind14 commented 2 years ago

@Slind14 what is your use-case?

We would like to keep a history of recently failed jobs within Redis. Otherwise removeOnFail would work, I guess.

bilalshaikh42 commented 2 years ago

We have the same issue. We have one top-level job with multiple sub-components. If any of the components fail, we would like for the parent job to be marked as failed so that we can log the errors into a database based on the status of the top-level job

manast commented 2 years ago

@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.

manast commented 2 years ago

@roggervalf you could link to this issue regarding the new feature to support failing parents.

bilalshaikh42 commented 2 years ago

@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.

Sure, in this case I would expect all the retires to be exhausted before the child job or the parent job is marked as failed. Similarly, even though one child job fails, I expect other child jobs would still be attempted/pass as long as they do not also depend on child job.

I had not considered the case you mention about manual intervention into a failed job to get the parent job to fail. To support that I guess there would need to be some status for parent other the failed, such as "failed-children" just as there is waiting and "waiting-children". This way if a child job is retired you could revaluate or rerun the parent.

In the way things are set up now, given just the parent job, is there a way to know when the child jobs have failed without iterating through them all? The status of the parent job will just be stuck on "wait-children" so our code was initially waiting forever instead of realizing that a child job has failed.

manast commented 2 years ago

@bilalshaikh42 we are trying to iron out different edge cases, so the case when child jobs fail will be improved. But we need to be very careful as any new change usually implies lots of new edge cases.

bilalshaikh42 commented 2 years ago

@manast yup, makes sense! Would be happy to help with feedback and/or testing out ideas

In the meantime, do you have any recommendations on the best way to check for a parent job who's dependencies have failed?

manast commented 2 years ago

@roggervalf do we have anything new to update here after the last updates to flows?

roggervalf commented 2 years ago

This is in my pending list, this week I can work on this feature đź‘€

C0rellana commented 2 years ago

Hi, Is there any way to solve this problem?, I have the same problem, if the child job fails or stops, the parent job does not run

arslnb commented 1 year ago

+1 on this, with some notes:

huksley commented 1 year ago

Hi, I stubled upon this issue as well. I have failing children jobs, and when all execution attempts on children jobs have been made (and failed), it should return execution to the parent job.

Perhaps adding getChildrenStates() method or similar to read states of all children job.

With current design, one should write children workers as bulletproof so they never fail, I don't think it is feasible. Now parent stuck with queue returning jobs with getWaitingChildren() > 0

Leobaillard commented 1 year ago

Hi, this is affecting us as well. I agree with passed comments, I think it's unrealistic to expect that child jobs should never fail. Is this issue still worked on? Thanks!

roggervalf commented 1 year ago

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

Leobaillard commented 1 year ago

hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?

Hi! Thanks for your quick answer!

It does, in part.

There are still use cases where it would be nice to allow child jobs to fail if they are not critical to the success (or partial success) of the parent. Users would then be expected to handle this "partial" state in their business logic. This is useful to keep failed child job history while retaining the possibility to have the parent job executed.

Some sort of job report can then be generated by the app, listing the successful and failed child tasks.

theDanielJLewis commented 1 year ago

I'm in the same boat as everyone else.

It seems odd that the only two options right now for when a child fails are:

  1. The parent does nothing and is never even added to the queue.
  2. Use failParentOnFailure and the parent will always fail when a child fails, even if there were successful children to process.

I wish we had, either as the default or a different option, something like continueParentOnFailure that would allow children to fail but still add the parent to the queue and let the parent process separately.

Adding this option to children would allow us to let some children actually fail the parent, while letting others not affect the parent.

But doing nothing—not even appearing in the queue—just seems strange and not an obvious result.

roggervalf commented 1 year ago

hey @theDanielJLewis, we have a pr for it https://github.com/taskforcesh/bullmq/pull/1953 You also can take a look on that one, @manast and I are evaluating this new feature

theDanielJLewis commented 1 year ago

Awesome, @roggervalf! Only a week old, so I can have hope it will be merged soon. Thanks!

Nick-Lucas commented 9 months ago

This is also causing my project a problem now, thought I'd worked out how to run a tree of jobs and fetch the states afterwards to display what passed and failed, but having to detach failed children with removeDependencyOnFailure to finish the parent kind of defeats the point by hiding the errors rather than surfacing them

roggervalf commented 9 months ago

I think this issue is related to https://github.com/taskforcesh/bullmq/issues/2092

Nick-Lucas commented 9 months ago

I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.

Anyway workaround for now seem to be:

Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"

rosslavery commented 9 months ago

I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.

Anyway workaround for now seem to be:

  • Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value
  • In the parent getting all the return values and deciding how to handle any failures

Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"

The workaround you suggest is precisely what we do.

Then we have some helper utils to check for the special "succeeded but actually failed" return value, and allow the parent to continue or failed based on things like the percentage of child jobs that truly succeeded or failed.

sick-sw commented 1 month ago

Ran into the same issue here. It feels like the parent should have control to whether or not it should fail or not, based off of it's children. This is a little more intuitive. If we funneled all failures/completions etc to the parent to process, it would really make things easier, and the processor would actually hit (and leave waiting children depending on your action).

I ended up returning failed data from a try catch in the child processor and works fine (forcing a completion), although a little misleading, due to not seeing failed children in the queue. ignoreDependencyOnFailure didn't appear to do anything after testing.

It would be nice to see this feature allowing processing of the parent no matter the failures/completions etc, where the parent would decide it's fate and take action where needed.

roggervalf commented 1 month ago

hi @sick-sw, sounds like ignoreDependencyOnFailure is in fact what you are looking for https://docs.bullmq.io/guide/flows/ignore-dependency, in order to get children failures you need to use getFailedChildrenValues method if you want to do some logic with them.