datastax 2.17.1: Deadlock in BasicsTests.Integration_Cassandra_FutureCallback* tests

k0machi commented 1 year ago

Regression in 4a0e4f78 of datastax cpp-driver, supposedly fixes following issue: CPP-987.

HOWEVER, it introduces a deadlock (at least in this test case, but it seems critical):

The test:

static void on_future_callback_connect_close(CassFuture* future, void* data) {
  bool* is_success = static_cast<bool*>(data);
  *is_success = cass_future_error_code(future) == CASS_OK;
}

/**
 * Verify a future callback is called when connecting a session.
 *
 * @expected_result The flag is set correctly inside the connect future callback.
 */
CASSANDRA_INTEGRATION_TEST_F(BasicsTests, FutureCallbackConnect) {
  CHECK_FAILURE;

  Session session;
  Future future = default_cluster().connect_async(session);

  bool is_success = false;
  cass_future_set_callback(future.get(), on_future_callback_connect_close, &is_success);

  future.wait();

  EXPECT_TRUE(is_success);
}

Is very simple, create a callback on a session when it connects. Now, the commit changes the following file:

diff --git a/src/future.cpp b/src/future.cpp
index 1d4ad534..9b882add 100644
--- a/src/future.cpp
+++ b/src/future.cpp
@@ -189,7 +189,6 @@ bool Future::set_callback(Future::Callback callback, void* data) {
 }

 void Future::internal_set(ScopedMutex& lock) {
-  is_set_ = true;
   if (callback_) {
     Callback callback = callback_;
     void* data = data_;
@@ -197,6 +196,11 @@ void Future::internal_set(ScopedMutex& lock) {
     callback(CassFuture::to(this), data);
     lock.lock();
   }
+
+  // CPP-987 Set this after the callbacks run to avoid unexpected exits
+  // from wait ops due to spurious wakeups
+  is_set_ = true;
+
   // Broadcast after we've run the callback so that threads waiting
   // on this future see the side effects of the callback.
   uv_cond_broadcast(&cond_);

Now, here's the deadlock:

The future is created and assigned a callback.
The future is waited by the test case.
- The wait is achieved by a call to cass_future_error_code which in turn calls Future::error which then calls Future::internal_wait that waits on is_set_
The future completes and Future::internal_set is called, which then proceeds to call the callback.
The callback calls cass_future_error_code, completing the deadlock as it now waits forever on is_set_.
With this commit, is_set_ is supposed to be set AFTER callbacks are processed, yet the callback can call public (I am assuming) API to check the status of the future, causing a deadlock to occur.

k0machi commented 1 year ago

Reading upstream issue list:

Bret McGuire 
added a Comment
24 October 2023 at 18:11

Thanks for the follow-up @Carl Colena .  I’ll readily admit I’ve been going back and forth on exactly the question you raise, but after thinking about it a fair amount I’m inclined to leave the change in despite the deadlock possibility.  My rationale is that the behaviour you highlight (no enforcement of callbacks being guaranteed to complete before we return from a cass_future_wait()) is significant enough to warrant any risk because:

There’s pretty clearly an implicit assumption that those callbacks will have completed before we return.  There’s no explicit statement of such a guarantee so I suppose you could argue this is more implied than explicit but it seems like an entirely reasonable assumption for users of the API to make.

The error case you cite can occur with normal usage of the futures API if a callback happens to take too long (or system conditions otherwise trigger a spurious wakeup).  You don’t have to do something we recommend against (such as calling cass_future_wait() in a callback) to trigger the error case; it could potentially apply to anyone anywhere.

On the other side: the vastly increased chance/near certainty of deadlock for callers of cass_future_wait() in a callback is certainly a non-trivial issue.  In the future I’d like to find a way to mitigate this as well, but it’s worth noting that we have an [explicit warning against exactly this use case](https://docs.datastax.com/en/developer/cpp-driver/2.17/api/struct.CassFuture/#function-cass_future_wait) in the docs (as you mention).  I’m hard-pressed to say we should mitigate this concern by keeping the behaviours described above.

I’d also argue that a prohibition against calling some kind of wait() function in a callback isn’t a unique constraint of the C/C++ driver.  Similar systems often don’t expose the future object directly (thus avoiding the issue entirely) but in cases where they do it’s broadly known that a wait() op in a callback is a very bad idea (for reasons similar to what we’re talking about here).  So our advice to not use wait() should seem very familiar to users of async APIs.

It would be nice if we could come up with an alternate approach which might address both issues, but the issue described in this ticket is significant enough that I don’t want to leave it out of the next release (which is coming soon).  I’m happy to revisit the approach we’ve taken here… honestly, there’s possibly an argument for changing the futures API to not even pass the future itself around but just to return values/errors instead, but that’s more speculative.  And I’ll definitely include a warning around this change in behaviour when I send out the announcement of the next release.  But I’m worried enough about the callback problem that I don’t want to keep this fix out.

I guess we'll ignore the tests in question and see if it breaks anything else...

fruch commented 1 year ago

Reading upstream issue list:

Bret McGuire 
added a Comment
24 October 2023 at 18:11

Thanks for the follow-up @Carl Colena .  I’ll readily admit I’ve been going back and forth on exactly the question you raise, but after thinking about it a fair amount I’m inclined to leave the change in despite the deadlock possibility.  My rationale is that the behaviour you highlight (no enforcement of callbacks being guaranteed to complete before we return from a cass_future_wait()) is significant enough to warrant any risk because:

There’s pretty clearly an implicit assumption that those callbacks will have completed before we return.  There’s no explicit statement of such a guarantee so I suppose you could argue this is more implied than explicit but it seems like an entirely reasonable assumption for users of the API to make.

The error case you cite can occur with normal usage of the futures API if a callback happens to take too long (or system conditions otherwise trigger a spurious wakeup).  You don’t have to do something we recommend against (such as calling cass_future_wait() in a callback) to trigger the error case; it could potentially apply to anyone anywhere.

On the other side: the vastly increased chance/near certainty of deadlock for callers of cass_future_wait() in a callback is certainly a non-trivial issue.  In the future I’d like to find a way to mitigate this as well, but it’s worth noting that we have an [explicit warning against exactly this use case](https://docs.datastax.com/en/developer/cpp-driver/2.17/api/struct.CassFuture/#function-cass_future_wait) in the docs (as you mention).  I’m hard-pressed to say we should mitigate this concern by keeping the behaviours described above.

I’d also argue that a prohibition against calling some kind of wait() function in a callback isn’t a unique constraint of the C/C++ driver.  Similar systems often don’t expose the future object directly (thus avoiding the issue entirely) but in cases where they do it’s broadly known that a wait() op in a callback is a very bad idea (for reasons similar to what we’re talking about here).  So our advice to not use wait() should seem very familiar to users of async APIs.

It would be nice if we could come up with an alternate approach which might address both issues, but the issue described in this ticket is significant enough that I don’t want to leave it out of the next release (which is coming soon).  I’m happy to revisit the approach we’ve taken here… honestly, there’s possibly an argument for changing the futures API to not even pass the future itself around but just to return values/errors instead, but that’s more speculative.  And I’ll definitely include a warning around this change in behaviour when I send out the announcement of the next release.  But I’m worried enough about the callback problem that I don’t want to keep this fix out.

I guess we'll ignore the tests in question and see if it breaks anything else...

Please give a link to the source you are quoting from, if it's an issue, we should comment there with our finding that is 100% failing with scylla

k0machi commented 1 year ago

Please give a link to the source you are quoting from, if it's an issue, we should comment there with our finding that is 100% failing with scylla

It's from the issue I've linked - CPP-987. They seem to know that there's a deadlock possibility, but I am not sure if they find this behaviour expected. Will add my comment.

scylladb / scylla-cpp-driver-matrix

datastax 2.17.1: Deadlock in BasicsTests.Integration_Cassandra_FutureCallback* tests #16