I noticed this issue while upgrading from 4.0.8 to 6.0.2, and it looks like it's a consequence of 2270056aa8d2a24c195606d55c955e38f4d50870
The way RoundRobinServerSet picks a server is balanced when all servers are available, but can end up with significant bias when a server fails and is added to the blacklist.
Supposed we have a server set with entries:
ldap01.zone1.example.com
ldap02.zone1.example.com
ldap03.zone2.example.com
ldap04.zone2.example.com
4 servers, split across 2 availability zones.
When all 4 are available, RoundRobinServerSet will loop through them evenly and distribute the load.
But if the network connection to Zone 1 drops, then the first 2 servers become unavailable, and the response from getConnection will bias (heavily) to the first entry after the failed severs (in our example ldap03).
This is because the nextSlot field doesn't care about availability, and the process by which RoundRobinServerSet looks for an alternative server always selects the next available server. Consequently, in our example:
the first request to getConnection() will select slot 0 and try to connect to ldap01. That fails so it tries slot 1 / ldap02 which also fails. It tries slot 2 / ldap03 and that works, so it returns a connection to ldap03
the next request will select slot 1 which fails, so it also returns slot 2 / ldap03
the next request will select slot 2 / ldap03, which works
the next request selects slot 3 / ldap04, which works,
then we're back to slot 0 again.
The result is that 75% of connections use ldap03 and 25% use ldap04
This isn't simple to fix, given the desire for high concurrency (per 2270056aa8d2a24c195606d55c955e38f4d50870) but one option would be to implement the following algorithm:
I noticed this issue while upgrading from 4.0.8 to 6.0.2, and it looks like it's a consequence of 2270056aa8d2a24c195606d55c955e38f4d50870
The way
RoundRobinServerSet
picks a server is balanced when all servers are available, but can end up with significant bias when a server fails and is added to the blacklist.Supposed we have a server set with entries:
ldap01.zone1.example.com
ldap02.zone1.example.com
ldap03.zone2.example.com
ldap04.zone2.example.com
4 servers, split across 2 availability zones.
When all 4 are available,
RoundRobinServerSet
will loop through them evenly and distribute the load.But if the network connection to Zone 1 drops, then the first 2 servers become unavailable, and the response from
getConnection
will bias (heavily) to the first entry after the failed severs (in our exampleldap03
).This is because the
nextSlot
field doesn't care about availability, and the process by whichRoundRobinServerSet
looks for an alternative server always selects the next available server. Consequently, in our example:getConnection()
will select slot0
and try to connect toldap01
. That fails so it tries slot1
/ldap02
which also fails. It tries slot2
/ldap03
and that works, so it returns a connection toldap03
1
which fails, so it also returns slot2
/ldap03
2
/ldap03
, which works3
/ldap04
, which works,0
again.The result is that 75% of connections use
ldap03
and 25% useldap04
This isn't simple to fix, given the desire for high concurrency (per 2270056aa8d2a24c195606d55c955e38f4d50870) but one option would be to implement the following algorithm:
slotNumber
!=initialSlotNumber
startSlot
andslotNumber
(using a ring calculation), setnextSlot
toslotNumber+1 % addresses.length
That should succeed in skipping a run of failed servers without a significant impact on concurrency.
I'd be happy to pull together a patch, but I take it you don't want that.