Open m4nuC opened 3 years ago
Currently by design, until all child jobs have been completed the parent job will not be processed.
@manast Gotcha thanks, so then the only way would be to catch errors and manually mark child jobs as completed or is there an option for this ? I assumed that removeOnFail
was doing something like this.
@manast to clarify what I am trying to achieve is:
Regarding the two last points:
- If max amount of retries is reach on children, then move job to fail
- Once all child jobs are either failed or completed, then run parent.
This would imply new functionality. Namely that a parent job should be "processed" OR "failed" depending on different criteria. For example, it could be the case that we want the parent to fail if one child fails already, OR as you wrote complete if all children either fails or completes, etc.
This affects us as well. Is there a possibility to see a configurable outcome in the near future?
@Slind14 what is your usecase?
@Slind14 what is your use-case?
We would like to keep a history of recently failed jobs within Redis. Otherwise removeOnFail
would work, I guess.
We have the same issue. We have one top-level job with multiple sub-components. If any of the components fail, we would like for the parent job to be marked as failed so that we can log the errors into a database based on the status of the top-level job
@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.
@roggervalf you could link to this issue regarding the new feature to support failing parents.
@bilalshaikh42 the problem here is that for example, you could have retries set up on the children, so you probably do not want to fail the parent at least until all the retries have been exhausted. The other common case is that a child job fails, but then you fix the reason for the failure, and then you want to manually retry the job, if the job finally completes we should complete the parent.
Sure, in this case I would expect all the retires to be exhausted before the child job or the parent job is marked as failed. Similarly, even though one child job fails, I expect other child jobs would still be attempted/pass as long as they do not also depend on child job.
I had not considered the case you mention about manual intervention into a failed job to get the parent job to fail. To support that I guess there would need to be some status for parent other the failed, such as "failed-children" just as there is waiting and "waiting-children". This way if a child job is retired you could revaluate or rerun the parent.
In the way things are set up now, given just the parent job, is there a way to know when the child jobs have failed without iterating through them all? The status of the parent job will just be stuck on "wait-children" so our code was initially waiting forever instead of realizing that a child job has failed.
@bilalshaikh42 we are trying to iron out different edge cases, so the case when child jobs fail will be improved. But we need to be very careful as any new change usually implies lots of new edge cases.
@manast yup, makes sense! Would be happy to help with feedback and/or testing out ideas
In the meantime, do you have any recommendations on the best way to check for a parent job who's dependencies have failed?
@roggervalf do we have anything new to update here after the last updates to flows?
This is in my pending list, this week I can work on this feature đź‘€
Hi, Is there any way to solve this problem?, I have the same problem, if the child job fails or stops, the parent job does not run
+1 on this, with some notes:
job.getChildrenValues()
) it will never run if one of the child jobs fail. failParentOnFailure
as true
, the parent job is failed, but you cannot pick up the error with Worker.on()
listener, as there is no error "thrown" in our use case (the child and parent have a different worker) removeOnFail
to the child job doesn't work eitherHi, I stubled upon this issue as well. I have failing children jobs, and when all execution attempts on children jobs have been made (and failed), it should return execution to the parent job.
Perhaps adding getChildrenStates()
method or similar to read states of all children job.
With current design, one should write children workers as bulletproof so they never fail, I don't think it is feasible.
Now parent stuck with queue returning jobs with getWaitingChildren() > 0
Hi, this is affecting us as well. I agree with passed comments, I think it's unrealistic to expect that child jobs should never fail. Is this issue still worked on? Thanks!
hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?
hi @Leobaillard, we currently have failParentOnFailure option https://docs.bullmq.io/guide/flows/fail-parent, could it address your case?
Hi! Thanks for your quick answer!
It does, in part.
There are still use cases where it would be nice to allow child jobs to fail if they are not critical to the success (or partial success) of the parent. Users would then be expected to handle this "partial" state in their business logic. This is useful to keep failed child job history while retaining the possibility to have the parent job executed.
Some sort of job report can then be generated by the app, listing the successful and failed child tasks.
I'm in the same boat as everyone else.
It seems odd that the only two options right now for when a child fails are:
failParentOnFailure
and the parent will always fail when a child fails, even if there were successful children to process.I wish we had, either as the default or a different option, something like continueParentOnFailure
that would allow children to fail but still add the parent to the queue and let the parent process separately.
Adding this option to children would allow us to let some children actually fail the parent, while letting others not affect the parent.
But doing nothing—not even appearing in the queue—just seems strange and not an obvious result.
hey @theDanielJLewis, we have a pr for it https://github.com/taskforcesh/bullmq/pull/1953 You also can take a look on that one, @manast and I are evaluating this new feature
Awesome, @roggervalf! Only a week old, so I can have hope it will be merged soon. Thanks!
This is also causing my project a problem now, thought I'd worked out how to run a tree of jobs and fetch the states afterwards to display what passed and failed, but having to detach failed children with removeDependencyOnFailure
to finish the parent kind of defeats the point by hiding the errors rather than surfacing them
I think this issue is related to https://github.com/taskforcesh/bullmq/issues/2092
I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.
Anyway workaround for now seem to be:
Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"
I assume there's an underlying architectural problem, because it's not entirely clear why we need to disconnect a child and then find it again as per #2092, instead of the parent having a flag to continue on child failures and we use some API to check children manually and decide what state the Parent should end up in. We're going to end up doing that last leg regardless but it's a bit convoluted.
Anyway workaround for now seem to be:
- Wrapping child jobs in try-catch-rethrow and on the final retry making the job pass but with some userland failed status as the return value
- In the parent getting all the return values and deciding how to handle any failures
Arguably that's actually quite a good way to go about it regardless as it's more obvious than an incantation of several flags, and "completed" doesn't have to mean "succeeded"
The workaround you suggest is precisely what we do.
Then we have some helper utils to check for the special "succeeded but actually failed" return value, and allow the parent to continue or failed based on things like the percentage of child jobs that truly succeeded or failed.
Ran into the same issue here. It feels like the parent should have control to whether or not it should fail or not, based off of it's children. This is a little more intuitive. If we funneled all failures/completions etc to the parent to process, it would really make things easier, and the processor would actually hit (and leave waiting children depending on your action).
I ended up returning failed data from a try catch in the child processor and works fine (forcing a completion), although a little misleading, due to not seeing failed children in the queue. ignoreDependencyOnFailure didn't appear to do anything after testing.
It would be nice to see this feature allowing processing of the parent no matter the failures/completions etc, where the parent would decide it's fate and take action where needed.
hi @sick-sw, sounds like ignoreDependencyOnFailure is in fact what you are looking for https://docs.bullmq.io/guide/flows/ignore-dependency, in order to get children failures you need to use getFailedChildrenValues method if you want to do some logic with them.
I have a FlowProducer that runs a parent job after a set of children jobs have been executed. There are about 10k children that are being run concurrently. I've set the option
removeOnFail
totrue
on both the children and the parents but it seems that if a child fail the execution of the parent just hangs.Could this be a bug or is it a configuration issue on my side?
Here is what the code that create the job looks like:
and the worker that run the children job looks like this: