Open martinslota opened 4 months ago
I now created a separate repository that (hopefully) makes it easy to reproduce the bug.
We have been using the fix in this branch in production throughout the last roughly 3 months and it has considerably reduced the error rates we are seeing when shutting down Bull queue clients.
Motivation and Background
This is an attempt to fix errors occurring when a
connect()
call is made shortly after adisconnect()
, which is something that the Bull library does when pausing a queue.Here's a relatively minimal way to reproduce an error:
Running that script in a loop using
against the
main
branch ofioredis
quickly results in this output:My debugging led me to believe that the existing node cleanup logic in the
ConnectionPool
class leads to race conditions: upondisconnect()
, the this.connectionPool.reset() call will remove nodes from the pool without cleaning up the event listener which may then subsequently issue more than onedrain
event. Depending on timing, one of the extradrain
events may fire afterconnect()
and change the status toclose
, interfering with the connection attempt and leading to the error above.Changes
ConnectionPool
class and remove them from the nodes whenever they are removed from the pool.-node
/drain
regardless of whether nodes disconnected or were removed through areset()
call.reset()
, add nodes before removing old ones to avoid unwanteddrain
events.this
point to the connection pool instance.main
is seemingly different from the error shown above but it still seems related to the disconnection logic and still gets fixed by the changes in this PR.