taskforcesh / bullmq

BullMQ - Message Queue and Batch processing for NodeJS and Python based on Redis
https://bullmq.io
MIT License
6.21k stars 407 forks source link

[Bug]: FlowProducer's Parent Job Fail Event Not Triggered When Child Job Fails with failParent Option #2918

Open miridih-jujang opened 2 days ago

miridih-jujang commented 2 days ago

Version

Bullmq 5.28.0

Platform

NodeJS v22.8.0

What happened?

When using FlowProducer with child jobs that have the failParent: true option set, if a child job fails, the parent job correctly transitions to a failed state, but the parent job's 'failed' event is not emitted. This creates inconsistency in event handling and monitoring, as we can only catch the child job's failure event but not the parent's.

Project Impact

Our project currently uses BullMQ Pro to implement a complex job processing system. This bug causes several critical issues:

Inability to Track Failures: Cannot accurately track the failure of entire job flows due to missing parent job failure events Critical failure situations may be missed in the monitoring system

Incomplete Error Handling: Unable to automate appropriate follow-up actions for parent job failures Recovery mechanisms for failed jobs may not function properly

Business Logic Impact: Unable to manage the state of entire job groups, affecting interconnected business processes Cannot provide accurate job status updates to users

Reduced System Stability: Accurate detection and handling of failed jobs becomes impossible, reducing system stability Automatic recovery mechanisms in failure scenarios do not work properly

This issue prevents us from achieving our project's core requirements of reliable job processing and monitoring. It's particularly concerning that such a fundamental feature isn't working properly despite using the Pro version, directly impacting our project's overall quality and reliability.

Need for Resolution

Resolving this bug is essential for the successful completion of our project. While workarounds are possible, they increase system complexity and make maintenance more difficult. As Pro version users, we expect this core functionality to work correctly and request a prompt resolution.

How to reproduce.

import { FlowProducerPro, QueuePro, WorkerPro } from '@taskforcesh/bullmq-pro';

async function runTest() {
  // Create queues
  const parentQueue = new QueuePro('parent');
  const childQueue = new QueuePro('child');

  // Create workers
  const parentWorker = new WorkerPro(
    'parent',
    async (job) => {
      console.log('Parent job processing:', job.id);
    },
    { connection: { host: 'localhost', port: 6379 } },
  );

  const childWorker = new WorkerPro(
    'child',
    async (job) => {
      console.log('Child job processing:', job.id);
      throw new Error('Child job failed');
    },
    { connection: { host: 'localhost', port: 6379 } },
  );

  parentWorker.on('failed', (job, err) => {
    console.log('Parent job failed event:', job.id, err.message);
  });

  childWorker.on('failed', (job, err) => {
    console.log('Child job failed event:', job.id, err.message);
  });

  // Create flow
  const flow = new FlowProducerPro({
    connection: { host: 'localhost', port: 6379 },
  });

  try {
    const flowJob = await flow.add({
      name: 'parent-job',
      queueName: 'parent',
      data: { foo: 'bar' },
      children: [
        {
          name: 'child-job',
          data: { bar: 'baz' },
          queueName: 'child',
          opts: {
            failParentOnFailure: true,
          },
        },
      ],
    });

    console.log('Flow created:', flowJob.job.id);

    // Wait for some time to see the events
    await new Promise((resolve) => setTimeout(resolve, 5000));
  } catch (error) {
    console.error('Error:', error);
  } finally {
    // Cleanup
    await Promise.all([
      parentWorker.close(),
      childWorker.close(),
      parentQueue.close(),
      childQueue.close(),
    ]);
  }
}

runTest().catch(console.error);

Relevant log output

Child job processing: c0e9f776-8752-4dcd-a852-c69450d383db
Child job failed event: c0e9f776-8752-4dcd-a852-c69450d383db Child job failed
roggervalf commented 13 hours ago

hi @miridih-jujang, this is expected as parent job is being moved to failed when children are processed by Workers in 'child' queue, so Worker 'parent' doesn't have a way to get that event as we send 'failed' worker events when this particular instance process an active job. Worker events are being triggered for jobs that are being processed and no for other jobs like parents. For this case you can use our QueueEventsPro class, listening our 'failed' event https://docs.bullmq.io/guide/events https://api.docs.bullmq.io/interfaces/v5.QueueEventsListener.html#failed