Closed rosslavery closed 1 year ago
Digging a bit deeper, our ADWORDS_METRICS_DAILY_SAVE
job relies on job.getChildrenValues
which received a refactor during the version upgrade I referred to.
I believe this refactor may be noticeably slower which is causing the jobs to stall. I am preparing some benchmark data now to confirm my theory.
Thanks for the reporting. I do not think it depends on that refactoring though 🤔, what do you think @roggervalf ?
I do know in the past for myself, .reduce
-ing over large datasets with object-spread is less performant than other approaches.
Still working on getting benchmark data to confirm, but reverting to the old version has caused our job completion time to go back to it's original time (70+ seconds back to ~20).
How many children do you have in these jobs?
In the above production example, ~9000 children.
https://prateeksurana.me/blog/why-using-object-spread-with-reduce-bad-idea/ https://www.richsnapp.com/article/2019/06-09-reduce-spread-anti-pattern
Rough benchmarks ran locally (so more powerful CPU than production), and with a smaller dataset (500 children):
Obviously 195ms it not problematic, but magnified to a large number of children, it becomes a problem.
5.0.1 - 500 children - getChildrenValues: 64.146ms
5.1.5 - 500 children - getChildrenValues: 195.146ms
Pertinent code being benchmarked:
console.time('getChildrenValues');
const allResults: Record<string, AdwordsMetricsDataChunk> = await job.getChildrenValues();
console.timeEnd('getChildrenValues');
Ok. We can optimize this with a for loop, which is usually much much faster.
For now I downgraded to 5.1.2 which is the last version that depends on bullmq 3.2.4 before the refact.
Thanks @manast @roggervalf !
I see pro@5.1.6 went out yesterday, but it still seems to be pointing to bullmq@3.4.1 instead of 3.4.2 with the 2x perf improvements? 🤔
hi @rosslavery , the release pipeline was failing yesterday, now it's working, please try version 5.1.7
Hi there,
We recently updated our bullmq-pro dependency from 5.0.1 -> 5.1.4 / 5.1.5. After this update, we have jobs that were completing reliably that are now encountering errors.
Example log output from our workers -- note the introduction of errors on December 13th:
Similarly, output from this morning's run:
The only change made on December 12th (and since then) has been the update to the bullmq-pro version:
The jobs are actually finished -- the work expected to be done is completed, but they are retried a number of times and end up in the "failed" bucket instead of "finished".
The problematic job is the last step in a flow structured as such: