vrk-kpa / xroad-joint-development

Unmaintained repository. Development moved to: https://github.com/nordic-institute/X-Road-development
19 stars 8 forks source link

As a Product Owner I want that the 'fastest wins' load balancing principle is improved so that the connection to a service in the subsystem will be established faster #58

Closed hanhaka closed 6 years ago

hanhaka commented 8 years ago

Affected components: - Affected documentation: https://github.com/ria-ee/X-Road/blob/develop/doc/Architecture/arc-ss_x-road_security_server_architecture.md Estimated delivery: Q1 / 2018 External reference: https://jira.csc.fi/browse/PVAYLADEV-448, https://jira.csc.fi/browse/PVAYLADEV-1033

Problem 'Fastest wins' load balancing principle does not work very smoothly in certain conditions. For example, if the firewall blocks the network traffic send by a Secure Server, and the firewall does not reply to requests with 'connection refused' or destination unreachable' error message, it takes considerable long time (~2 minutes) before the timeout is realized and the Secure Server starts trying the alternative connection (as the 'fastest win' load balancing realizes the situation).

At the moment the functionality goes so that after the connection to primary Secure Server host has been timed out, the secondary Secure Server host gets the connection request and is able to response. On the contrary, if the proxy process is shut downed (on primary Secure Server host) the host replies right away with 'connection refused' error message. This fastens the re-connection to secondary Secure Server host.

Due to above dilemma, there is a need to change the principle of 'fastest wins' load balancing functionality more than just a little bit. In practice this could mean that the 'fastest wins' principle would be replaced by 'round robin' or 'random' -principle (or some other principle) so that the host server will be really changed in round robin way/randomly or there will be rotation logic how the hosts will responsible to reply. If 'round robin' or 'random' principle selects the host provider that does not response then the request will be send to other Secure Server host that can serve the requester.

Updated 22.11.2017: Based on pre-study the following enhancements are found out to be useful and increasing quite easily the performance of 'fastest wins' functionality:

  1. The connection timeout to a "previously fastest" host could be lower than the initial connection timeout in order to reduce the fail-over time (simple)
  2. The connection initiation order could be randomized to avoid preferring the same server (simple)
  3. The selection expiration should be decoupled from the TLS session cache expiration (medium, need to also cache the selection time)

Note! See also TLS session issue

During the designing and implementing this issue it should be considered that the TLS sessions of all provider security server connections could be cached to speed up TLS handshake.

Acceptance criteria

hanhaka commented 6 years ago

Fixed in 6.17.0 For more information, see https://github.com/ria-ee/X-Road/pull/78