This PR adds some jitter to the backoff sleep of the processor fetch loop to avoid workers synchronization, and thus improve worker efficiency and job latency.
Context
We run dozens of workers, with an execution rate of up to 5K jobs/sec.
We started to suspect that the job workers were significantly inefficient.
Changing the TaskCheckInterval from 1s (default) to 100ms has reduced the latency from ~500ms to 40ms on a high traffic worker pool, and 10ms to a low traffic worker pool.
Then, adding the jitter has further reduced the latency to 2 ms on both high and low traffic worker pool.
Synchronization on a distributed system is well known, and adding jitter helped avoid it.
Changes
This PR adds some jitter to the backoff sleep of the processor fetch loop to avoid workers synchronization, and thus improve worker efficiency and job latency.
Context
We run dozens of workers, with an execution rate of up to 5K jobs/sec.
We started to suspect that the job workers were significantly inefficient.
Changing the TaskCheckInterval from 1s (default) to 100ms has reduced the latency from ~500ms to 40ms on a high traffic worker pool, and 10ms to a low traffic worker pool.
Then, adding the jitter has further reduced the latency to 2 ms on both high and low traffic worker pool.
Synchronization on a distributed system is well known, and adding jitter helped avoid it.
Impact of adding a jitter: Queue size:
Latency: