rosslavery commented 1 year ago

Hi there,

We recently updated our bullmq-pro dependency from 5.0.1 -> 5.1.4 / 5.1.5. After this update, we have jobs that were completing reliably that are now encountering errors.

Example log output from our workers -- note the introduction of errors on December 13th:

Similarly, output from this morning's run:

The only change made on December 12th (and since then) has been the update to the bullmq-pro version:

The jobs are actually finished -- the work expected to be done is completed, but they are retried a number of times and end up in the "failed" bucket instead of "finished".

The problematic job is the last step in a flow structured as such:

  const adwordsFlow: FlowJobPro = {
    name: 'ADWORDS_METRICS_DAILY_SAVE',
    queueName: 'ADWORDS_METRICS_DAILY_SAVE',
    opts: { removeOnComplete: true },
    children: adwordsAccountIds.map(accountId => ({
      name: 'ADWORDS_METRICS_DAILY_PROCESS',
      queueName: 'ADWORDS_METRICS_DAILY_PROCESS',
      opts: { removeOnComplete: true },
      data: {
        payload: { accountId, dailyDateRange }
      }
    }))
  };

rosslavery commented 1 year ago

Digging a bit deeper, our ADWORDS_METRICS_DAILY_SAVE job relies on job.getChildrenValues which received a refactor during the version upgrade I referred to.

I believe this refactor may be noticeably slower which is causing the jobs to stall. I am preparing some benchmark data now to confirm my theory.

manast commented 1 year ago

Thanks for the reporting. I do not think it depends on that refactoring though 🤔, what do you think @roggervalf ?

rosslavery commented 1 year ago

I do know in the past for myself, .reduce-ing over large datasets with object-spread is less performant than other approaches.

Still working on getting benchmark data to confirm, but reverting to the old version has caused our job completion time to go back to it's original time (70+ seconds back to ~20).

manast commented 1 year ago

How many children do you have in these jobs?

rosslavery commented 1 year ago

In the above production example, ~9000 children.

rosslavery commented 1 year ago

Related Material:

https://prateeksurana.me/blog/why-using-object-spread-with-reduce-bad-idea/ https://www.richsnapp.com/article/2019/06-09-reduce-spread-anti-pattern

Benchmarks

Rough benchmarks ran locally (so more powerful CPU than production), and with a smaller dataset (500 children):

Obviously 195ms it not problematic, but magnified to a large number of children, it becomes a problem.

5.0.1 - 500 children - getChildrenValues: 64.146ms
5.1.5 - 500 children - getChildrenValues: 195.146ms

Pertinent code being benchmarked:

  console.time('getChildrenValues');
  const allResults: Record<string, AdwordsMetricsDataChunk> = await job.getChildrenValues();
  console.timeEnd('getChildrenValues');

manast commented 1 year ago

Ok. We can optimize this with a for loop, which is usually much much faster.

rosslavery commented 1 year ago

For now I downgraded to 5.1.2 which is the last version that depends on bullmq 3.2.4 before the refact.

Thanks @manast @roggervalf !

rosslavery commented 1 year ago

I see pro@5.1.6 went out yesterday, but it still seems to be pointing to bullmq@3.4.1 instead of 3.4.2 with the 2x perf improvements? 🤔

roggervalf commented 1 year ago

hi @rosslavery , the release pipeline was failing yesterday, now it's working, please try version 5.1.7

taskforcesh / bullmq-pro-support

Updating from 5.0.1 -> 5.1.4/5.1.5 introduced major performance regression with `job.getChildrenValues` #33

Related Material:

Benchmarks