The current implementation has no way to properly monitor failure situations. This MR fixes it.
Motivation and context
There are multiple ways of failures and it's broken in many ways:
If a channel is closed by the server, the issued commands waiting for certain frames just stall. Their promises never get resolved
Auth errors result in a stalled connect() method. It never resolves
If a connection is closed unexpectedly, e.g. by a server restart, network outage, it takes very long until this can be detected and there is no proper monitoring. There's not way to be informed about a connection loss. It has to be manually probed
If a channel is closed by the server, there's no way to monitor it. There's not even a way to properly manually probe it
All these things are important to build a resilient and fault tolerant system with this library.
How has this been tested?
Manually tested different failure cases with a cluster of 3 nodes:
Killing all nodes
Killing a subset of nodes
Tested connect with wrong credentials
Types of changes
What types of changes does your code introduce? Put an x in all the boxes that apply:
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
Depends on how you look at it. It's both a new feature (monitoring connecitons) and fixing a lot of issues with the current error handling.
Checklist:
Go over all the following points, and put an x in all the boxes that apply.
Description
The current implementation has no way to properly monitor failure situations. This MR fixes it.
Motivation and context
There are multiple ways of failures and it's broken in many ways:
connect()
method. It never resolvesAll these things are important to build a resilient and fault tolerant system with this library.
How has this been tested?
Manually tested different failure cases with a cluster of 3 nodes:
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Depends on how you look at it. It's both a new feature (monitoring connecitons) and fixing a lot of issues with the current error handling.
Checklist:
Go over all the following points, and put an
x
in all the boxes that apply.If you're unsure about any of these, don't hesitate to ask. We're here to help!
=> Regarding tests: I have not written system tests for it. It's quite hard to write automated tests for failing infrastructure. What's your opinion?