The VHS connections is experiencing lost connections to a large number of nodes (only 40-50 node connections remained in February, with a low of 9-15 connections in January). This is because those node connections failed after being connected and the VHS does not try to reconnect until the next hour (bulk connection also led to slow response and high resource consumption). The frequent closing codes seen so far include:
1008: Policy error: client is too slow. (Most frequent)
1006: Abnormal Closure: The connection was closed abruptly without a proper handshake or a clean closure.
1005: No Status Received: An empty or undefined status code is used to indicate no further details about the closure.
This PR will allow the VHS to reconnect to established connections after failure, instead of waiting for an hour for a bulk connections attempt. This would also help fixing the problem in https://github.com/ripple/validator-history-service/pull/203.
Type of Change
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Refactor (non-breaking change that only restructures code)
[ ] Tests (You added tests for code that already exists, or your new feature included in this PR)
[ ] Documentation Updates
[ ] Release
Test Plan
Test on staging:
Monitor the number of messages for failed connection before and after connection
High Level Overview of Change
The VHS connections is experiencing lost connections to a large number of nodes (only 40-50 node connections remained in February, with a low of 9-15 connections in January). This is because those node connections failed after being connected and the VHS does not try to reconnect until the next hour (bulk connection also led to slow response and high resource consumption). The frequent closing codes seen so far include:
This PR will allow the VHS to reconnect to established connections after failure, instead of waiting for an hour for a bulk connections attempt. This would also help fixing the problem in https://github.com/ripple/validator-history-service/pull/203.
Type of Change
Test Plan
Test on staging: