Review error scenario of full network outage (no replica can be reached)

zecke commented 5 years ago

It seems several aspects need to be reviewed and understood better. This might be obsolete by implementing Mongo's SDAM specification.

VOMongoRepository class >> #defaultReplicaSetLookupTimeout vs. Socket standardTimeout: When trying to find a new primary maybe only a single one will be tried in VOMongoReplicationUrlResolver>>#lookupReplicaSetStatusIfFound: because of a big socket timeout (Mongo>>#open uses default socket time out of 45s). Maybe it makes sense to access knownMongoUrls atRandom to avoid always getting stuck on the first entry!
Do not (attempt to) populate session pool when there is no primary: If >>#basicReset fails primaryMongoUrl should not be assigned the first URL but no server.
Do not execute VOMongoReplicationUrlResolver>>#reset concurrently. Each VOExecutionStrategy will call >>#reset on failure and the last one will win. They should be serialized. The effect seems that after a change in primary it takes some time for voyage to "warm" up again to get the same transaction speed. This shouldn't be the case though.

zecke commented 5 years ago

Another issue is that some of the pooled Mongo instances are not closed. Leading to populate raising an exception about the external semaphores not being able to be registered.

zecke commented 5 years ago

With a session pool of max 10 entries I end up with 101 open Mongo connections during a stepdown from primary to the new primary.

This seems to be due lack of concurrency control in the VOMongoRepositoryResolver. In case a query fails >>#resetPool or just >>#reset will be invoked. In the latter the existing pool will be flushed and a new one is created. There are three race conditions.

The >>#flush and >>#populate are ran concurrently. New connections are added to a discarded pool after it has been flushed. This can be mitigated by setting the desired pool size to 0 before calling flush. It does seem to fix the issue I see.
With multiple calls to >>#reset at the same time we can populate/discard pools that we have not seen yet. There should be some locking around the pool...

pharo-nosql / voyage

Review error scenario of full network outage (no replica can be reached) #116