mongodb / specifications

Specifications related to MongoDB
http://specifications.readthedocs.io/en/latest
Other
389 stars 242 forks source link

DRIVERS-2707 Relax the CMAP documentation requirement for durations in events #1455

Closed stIncMale closed 1 year ago

stIncMale commented 1 year ago

DRIVERS-2707

@baileympearson, @matthewdale I added you as reviewers because you approved the PR where I introduced the wording that this PR changes.

Please complete the following before merging:

stIncMale commented 1 year ago

There were conversations in Slack related to this PR, but no consensus was achieved. @baileympearson, we need to move forward here somehow.

I am fine with either excluding from duration the time to synchronously deliver ConnectionCheckOutStartedEvent/ConnectionCreatedEvent, which is the current behavior, or including it. But I recognize that duration and waitQueueTimeoutMS are related: if we don't start counting time towards them at the same instant, we expose ourselves to behaviors that are surprising to users. Following are two examples with large synchronous delivery time spans to make the surprising behaviors apparent.

1.a. EXCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent from duration. Start counting time towards waitQueueTimeoutMS BEFORE emitting ConnectionCheckOutStartedEvent.

  1. Start counting time towards waitQueueTimeoutMS, which is 5 ms.
  2. Emit and synchronously deliver ConnectionCheckOutStartedEvent, which takes 100 ms.
  3. Start counting time towards ConnectionCheckOutFailedEvent.duration.
  4. Fail the checkout because of the timeout (100 ms > 5 ms).
  5. Measure duration since step 3, which is 1 ms. Emit ConnectionCheckOutFailedEvent with duration 1 ms.
  6. User observes a checkout failed due to waitQueueTimeoutMS (5 ms), but ConnectionCheckOutFailedEvent.duration (1ms) tells a user that there were plenty of time left to wait for an available connection (5 ms - 1 ms = 4 ms).

2.a. INCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent to duration. Start counting time towards waitQueueTimeoutMS AFTER emitting ConnectionCheckOutStartedEvent.

  1. Start counting time towards ConnectionCheckedOutEvent.duration.
  2. Emit and synchronously deliver ConnectionCheckOutStartedEvent, which takes 100 ms.
  3. Start counting time towards waitQueueTimeoutMS, which is 5 ms.
  4. Discover that there is an available connection.
  5. Measure duration since step 3, which is 1 ms. Emit ConnectionCheckedOutEvent with duration 100 ms + 1 ms = 101 ms.
  6. User observes a checkout succeeded with duration 101 ms despite waitQueueTimeoutMS being 5ms, and no new connections having been established. From this observation the user makes a conclusion that a driver did not respect waitQueueTimeoutMS.

While it is true that a driver does not provide hard real-time guarantees, the driver is still supposed to do its best to try and meet the specified timeouts. In these examples the driver failed to do its best, because there are ways to implement the functionality such that the user is spared from observing confusing data in executions as specified above.

The below options neither have the issues as in the examples above, nor do they have any new issues compared to those examples. Obviously, a driver still does not provide hard real-time guarantees, and a noticeable delay may happen at any point in execution, not only in user code when we synchronously deliver events. But a driver still must do its best, and the options below are better than the options chosen in the examples above:

1.b. EXCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent from duration. Start counting time towards waitQueueTimeoutMS AFTER emitting ConnectionCheckOutStartedEvent (this is the current state of this PR).

2.b. INCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent to duration. Start counting time towards waitQueueTimeoutMS BEFORE emitting ConnectionCheckOutStartedEvent.

Both of these options allow a driver to start counting time towards duration and waitQueueTimeoutMS at the same instant, and the spec will require this behavior if we choose either option 1.b or 2.b.

baileympearson commented 1 year ago

@stIncMale

As a side note - I don't think it's correct that duration and waitQueueTimeoutMS are related and that they must be aligned. It's already the case that no matter what we do with event delivery, waitQueueTimeoutMS and duration can have very different values. waitQueueTimeoutMS should not govern connection establishment, and consequently in drivers that establish connections synchronously when performing connection checkout ConnectionCheckoutSucceeded.duration could be significantly greater than waitQueueTimeoutMS. This is irrespective of synchronous / asynchronous event processing.


To answer your question from slack (which is in reference to my suggestion that duration be the difference between construction times of ConnectionCheckoutStartedEvents and ConnectionCheckoutSucceeded/failedEvents):

The above means that the duration of synchronous delivery of ConnectionCheckoutStarted is counted towards ConnectionCheckoutSucceeded.duration (option 1). Could you elaborate how the above avoids the discussion around waitQueueTimeoutMS?

I interpret the intention behind DRIVERS-2469 as: users have requested a way to determine how long connection checkout takes so they can optimize their application's consumption of driver connection pools. In drivers where events are emitted synchronously, this can impact pool performance. So, any synchronous event process definitely should be included in the new measurement.

Standardizing and mandating how drivers handle synchronous emissions of events is bad because:

The simplest way to measure that duration for all drivers is to take the difference between the when drivers construct ConnectionCheckoutStartedEvent and when drivers construct ConnectionCheckoutSucceededEvent.

This doesn't clarify in the spec whether or not the durations of synchronous event emissions should be included or not. However, it does

This also closely matches my intuition of "matching" corresponding checkout started and checkout succeeded events and subtracting timestamps, which I think is how many users might view the new "duration" property as well. And specifically for Node, we actually have a timestamp on every pool event, so a user might actually be confused as to why the duration property is significantly different from ConnectionCheckoutSucceeded - ConnectionCheckoutStarted.


Given the options outlined in proceeding comment, I strongly lean towards option 1.b. If we do commit to behavior here, we should test it (probably with an optional prose test for drivers that emit events synchronously).

1.b. EXCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent from duration. Start counting time towards waitQueueTimeoutMS AFTER emitting ConnectionCheckOutStartedEvent (this is the current state of this PR).