Implement Server Discovery And Monitoring Spec (SDAM)

In short, this issue proposes to create something like MongoClient that provides some features to better support high-availability with a replica sets setup. Following, I add an intro to what is SDAM and what this MongoClient should do.

SDAM

The Server Discovery And Monitoring Spec (SDAM), presented as a story in a Jesse Davis blog post (2015), suggests a behavior for all mongodb drivers. Davis extended this topic some months after in a 40 minutes talk.

In the spec, the Mongo client is instantiated with a seed list, which is the initial list of server addresses. The client "pings" the seed addresses to discover servers data ("topology" of servers and replica sets setup). This "ping" consists on a 'ismaster' command, which is very light for servers, and, from that, eventually more servers. Davis' definition:

The seed list is the stepping-off point for the driver's journey of discovery. As long as one seed is actually an available replica set member, the driver will discover the whole set and stay connected to it indefinitely, as described below. Even if every member of the set is replaced with a new host, like the Ship of Theseus, it is still the same replica set and the driver remains connected to it.

Then, the spec considers 3 kinds of implementations: the single-threaded (Perl), multi-threaded (Python, Java, Ruby, C#), and hybrid (C). The hybrid mixes single and multi.

Davis describes 3 main states for clients:

Initial state: Ping once each seed address to have the first topology. Move to Steady state when got all answers. (Note: it's not not clear to me how different is it from Crisis state).
Steady state: Ping servers every 10 secs to update data, and keep track of latency. When there is a failure, the exception is raised to the application, and it will move to Crisis state.
Crisis state: Enqueue incoming commands and ping every 0.5 seconds all known servers until a primary is found. Then get back to Steady state.

There are 2 important remarks for applications to use a client:

Non-blocking client construction: The clients must answer immediately when created instead of waiting to open connections. The connections might be done either just after client creation in background threads, or "on demand" when the first db operation is performed.
Error handling: In the talk (minute 29), the speaker recommends applications to retry once after a connection failure, and only notify user if this first retry failed too, as the retry is enqueued in the Crisis state and executed after recover, then it's very unlikely that a second retry will succeed. Duplicate key handling example in minute 30.

Server Selection

When a client receives a read command in steady state with primaryPreferred as read preference and the primary is not available, it might have several possible servers to execute the command. The Server Selection specification proposes the algorithm for server selection that deliver on three goals of being predictable, resilient, and low-latency. Golden's blog post describes this selection in detail, too.

Users will be able to control how long server selection is allowed to take with the serverSelectionTimeoutMS configuration variable and control the size of the acceptable latency window with the localThresholdMS configuration variable.

pharo-nosql / mongotalk

Implement Server Discovery And Monitoring Spec (SDAM) #54