Unmaintained repository. Development moved to: https://github.com/nordic-institute/X-Road-development
19
stars
8
forks
source link
As a Product Owner I want that the 'fastest wins' load balancing principle is improved so that the connection to a service in the subsystem will be established faster #58
Problem
'Fastest wins' load balancing principle does not work very smoothly in certain conditions. For example, if the firewall blocks the network traffic send by a Secure Server, and the firewall does not reply to requests with 'connection refused' or destination unreachable' error message, it takes considerable long time (~2 minutes) before the timeout is realized and the Secure Server starts trying the alternative connection (as the 'fastest win' load balancing realizes the situation).
At the moment the functionality goes so that after the connection to primary Secure Server host has been timed out, the secondary Secure Server host gets the connection request and is able to response. On the contrary, if the proxy process is shut downed (on primary Secure Server host) the host replies right away with 'connection refused' error message. This fastens the re-connection to secondary Secure Server host.
Due to above dilemma, there is a need to change the principle of 'fastest wins' load balancing functionality more than just a little bit. In practice this could mean that the 'fastest wins' principle would be replaced by 'round robin' or 'random' -principle (or some other principle) so that the host server will be really changed in round robin way/randomly or there will be rotation logic how the hosts will responsible to reply. If 'round robin' or 'random' principle selects the host provider that does not response then the request will be send to other Secure Server host that can serve the requester.
Updated 22.11.2017:
Based on pre-study the following enhancements are found out to be useful and increasing quite easily the performance of 'fastest wins' functionality:
The connection timeout to a "previously fastest" host could be lower than the initial connection timeout in order to reduce the fail-over time (simple)
The connection initiation order could be randomized to avoid preferring the same server (simple)
The selection expiration should be decoupled from the TLS session cache expiration (medium, need to also cache the selection time)
During the designing and implementing this issue it should be considered that the TLS sessions of all provider security server connections could be cached to speed up TLS handshake.
Acceptance criteria
Pros and cons of the 'round robin' and 'random' principles are investigated.
It is figured out how to change the 'fastest win' -principle so that connection to service provider is faster.
'Fastest wins' -principle will be boosted as the result of the pre-study made by implementing steps 1-3 (see above)
Changes are tested and measured in test environment to see how it will improve the functionality versus old principle.
The performance times for both principles are measured to make sure it will improve the functionality(average times, long term behaviour, stress tests etc.)
Affected components: - Affected documentation: https://github.com/ria-ee/X-Road/blob/develop/doc/Architecture/arc-ss_x-road_security_server_architecture.md Estimated delivery: Q1 / 2018 External reference: https://jira.csc.fi/browse/PVAYLADEV-448, https://jira.csc.fi/browse/PVAYLADEV-1033
Problem 'Fastest wins' load balancing principle does not work very smoothly in certain conditions. For example, if the firewall blocks the network traffic send by a Secure Server, and the firewall does not reply to requests with 'connection refused' or destination unreachable' error message, it takes considerable long time (~2 minutes) before the timeout is realized and the Secure Server starts trying the alternative connection (as the 'fastest win' load balancing realizes the situation).
At the moment the functionality goes so that after the connection to primary Secure Server host has been timed out, the secondary Secure Server host gets the connection request and is able to response. On the contrary, if the proxy process is shut downed (on primary Secure Server host) the host replies right away with 'connection refused' error message. This fastens the re-connection to secondary Secure Server host.
Due to above dilemma, there is a need to change the principle of 'fastest wins' load balancing functionality more than just a little bit. In practice this could mean that the 'fastest wins' principle would be replaced by 'round robin' or 'random' -principle (or some other principle) so that the host server will be really changed in round robin way/randomly or there will be rotation logic how the hosts will responsible to reply. If 'round robin' or 'random' principle selects the host provider that does not response then the request will be send to other Secure Server host that can serve the requester.
Updated 22.11.2017: Based on pre-study the following enhancements are found out to be useful and increasing quite easily the performance of 'fastest wins' functionality:
Note! See also TLS session issue
During the designing and implementing this issue it should be considered that the TLS sessions of all provider security server connections could be cached to speed up TLS handshake.
Acceptance criteria