DRIVERS-2170 Server info on retryable errors must reflect the originating server

jamis commented 10 months ago

Some drivers (e.g. Ruby) attach server information to certain errors for diagnostic purposes. This PR updates the language in the retryable reads and writes specifications to require that these drivers ensure that the server information that is associated with an exception reflects the server that actually originated the exception, even in the presence of one more more retries.

Please complete the following before merging:

[x] Update changelog.
[x] Make sure there are generated JSON files from the YAML test files.
[ ] Test changes in at least one language driver.
[ ] Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

NOTE: Tests for this situation (server information associated with exceptions during a retry) have proven challenging to write robustly. TL;DR -- I don't have a good solution for it, and currently cannot claim to have reliably tested this, though the code in the Ruby driver has been present since mid-2020.

I've found the most success with subscribing to command events, waiting for (e.g.) a 'find' event, and then forcing the primary to step down (per https://www.mongodb.com/docs/manual/tutorial/force-member-to-be-primary/) so that the retry hits a different server. I also capture logging output in the test, so I can then compare a message logged for the first exception (with the first server) with the message of the exception raised when the retry fails, and confirm that the servers differ. This mostly works, but (1) it only works for replica sets, and (2) it seems to fail non-deterministically, apparently due to some race condition associated with the setting of a failpoint.

kkloberdanz commented 10 months ago

Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification.

jamis commented 10 months ago

Hi Jamis, is there a succinct snippet from the Ruby driver that implements this behavior that you can share for this PR? Would a test for this behavior be feasible? I'm much more inclined to having testable behavior in a specification.

Hey Kyle, thanks for taking a look at the PR. The following example from the Ruby driver is taken from code that works with the underlying network socket, and replaces network errors with a driver-specific exception class, and annotates them with (among other things) the server address:

rescue Errno::ETIMEDOUT => e
  raise Error::SocketTimeoutError, "#{e.class}: #{e} (for #{human_address})"
rescue IOError, SystemCallError => e
  raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"
rescue OpenSSL::SSL::SSLError => e
  raise Error::SocketError, "#{e.class}: #{e} (for #{human_address})"

I share your reluctance to include a spec update without tests, but:

This change only applies to the small subset of drivers that are already annotating errors with server information, and
These errors may originate from a variety of different sources in the code--from network errors, to server errors--and what is more, this spec change only applies to those errors that arise as a result of a second error during a retry, where the retry was performed on a different server than the first attempt. This specific sequence of events has proven particularly challenging to reproduce in a test for even a single exception, let alone the full set of exceptions that this behavior might apply to.

Ultimately, I'm not sure that adding mandated tests to the spec for this behavior can or should be done in a way that excludes the drivers for which this does not apply, and manages to cover the potential errors that this relates to. I'm open to discussing it, though, if you have thoughts on how this ought to be done. (This is my first rodeo with writing a spec change, so I'm sure there's plenty I could be doing better, here!)

mongodb / specifications

DRIVERS-2170 Server info on retryable errors must reflect the originating server #1480