Open k0machi opened 1 year ago
Reading upstream issue list:
Bret McGuire
added a Comment
24 October 2023 at 18:11
Thanks for the follow-up @Carl Colena . I’ll readily admit I’ve been going back and forth on exactly the question you raise, but after thinking about it a fair amount I’m inclined to leave the change in despite the deadlock possibility. My rationale is that the behaviour you highlight (no enforcement of callbacks being guaranteed to complete before we return from a cass_future_wait()) is significant enough to warrant any risk because:
There’s pretty clearly an implicit assumption that those callbacks will have completed before we return. There’s no explicit statement of such a guarantee so I suppose you could argue this is more implied than explicit but it seems like an entirely reasonable assumption for users of the API to make.
The error case you cite can occur with normal usage of the futures API if a callback happens to take too long (or system conditions otherwise trigger a spurious wakeup). You don’t have to do something we recommend against (such as calling cass_future_wait() in a callback) to trigger the error case; it could potentially apply to anyone anywhere.
On the other side: the vastly increased chance/near certainty of deadlock for callers of cass_future_wait() in a callback is certainly a non-trivial issue. In the future I’d like to find a way to mitigate this as well, but it’s worth noting that we have an [explicit warning against exactly this use case](https://docs.datastax.com/en/developer/cpp-driver/2.17/api/struct.CassFuture/#function-cass_future_wait) in the docs (as you mention). I’m hard-pressed to say we should mitigate this concern by keeping the behaviours described above.
I’d also argue that a prohibition against calling some kind of wait() function in a callback isn’t a unique constraint of the C/C++ driver. Similar systems often don’t expose the future object directly (thus avoiding the issue entirely) but in cases where they do it’s broadly known that a wait() op in a callback is a very bad idea (for reasons similar to what we’re talking about here). So our advice to not use wait() should seem very familiar to users of async APIs.
It would be nice if we could come up with an alternate approach which might address both issues, but the issue described in this ticket is significant enough that I don’t want to leave it out of the next release (which is coming soon). I’m happy to revisit the approach we’ve taken here… honestly, there’s possibly an argument for changing the futures API to not even pass the future itself around but just to return values/errors instead, but that’s more speculative. And I’ll definitely include a warning around this change in behaviour when I send out the announcement of the next release. But I’m worried enough about the callback problem that I don’t want to keep this fix out.
I guess we'll ignore the tests in question and see if it breaks anything else...
Reading upstream issue list:
Bret McGuire added a Comment 24 October 2023 at 18:11 Thanks for the follow-up @Carl Colena . I’ll readily admit I’ve been going back and forth on exactly the question you raise, but after thinking about it a fair amount I’m inclined to leave the change in despite the deadlock possibility. My rationale is that the behaviour you highlight (no enforcement of callbacks being guaranteed to complete before we return from a cass_future_wait()) is significant enough to warrant any risk because: There’s pretty clearly an implicit assumption that those callbacks will have completed before we return. There’s no explicit statement of such a guarantee so I suppose you could argue this is more implied than explicit but it seems like an entirely reasonable assumption for users of the API to make. The error case you cite can occur with normal usage of the futures API if a callback happens to take too long (or system conditions otherwise trigger a spurious wakeup). You don’t have to do something we recommend against (such as calling cass_future_wait() in a callback) to trigger the error case; it could potentially apply to anyone anywhere. On the other side: the vastly increased chance/near certainty of deadlock for callers of cass_future_wait() in a callback is certainly a non-trivial issue. In the future I’d like to find a way to mitigate this as well, but it’s worth noting that we have an [explicit warning against exactly this use case](https://docs.datastax.com/en/developer/cpp-driver/2.17/api/struct.CassFuture/#function-cass_future_wait) in the docs (as you mention). I’m hard-pressed to say we should mitigate this concern by keeping the behaviours described above. I’d also argue that a prohibition against calling some kind of wait() function in a callback isn’t a unique constraint of the C/C++ driver. Similar systems often don’t expose the future object directly (thus avoiding the issue entirely) but in cases where they do it’s broadly known that a wait() op in a callback is a very bad idea (for reasons similar to what we’re talking about here). So our advice to not use wait() should seem very familiar to users of async APIs. It would be nice if we could come up with an alternate approach which might address both issues, but the issue described in this ticket is significant enough that I don’t want to leave it out of the next release (which is coming soon). I’m happy to revisit the approach we’ve taken here… honestly, there’s possibly an argument for changing the futures API to not even pass the future itself around but just to return values/errors instead, but that’s more speculative. And I’ll definitely include a warning around this change in behaviour when I send out the announcement of the next release. But I’m worried enough about the callback problem that I don’t want to keep this fix out.
I guess we'll ignore the tests in question and see if it breaks anything else...
Please give a link to the source you are quoting from, if it's an issue, we should comment there with our finding that is 100% failing with scylla
Please give a link to the source you are quoting from, if it's an issue, we should comment there with our finding that is 100% failing with scylla
It's from the issue I've linked - CPP-987. They seem to know that there's a deadlock possibility, but I am not sure if they find this behaviour expected. Will add my comment.
Regression in
4a0e4f78
of datastax cpp-driver, supposedly fixes following issue: CPP-987.HOWEVER, it introduces a deadlock (at least in this test case, but it seems critical):
The test:
Is very simple, create a callback on a session when it connects. Now, the commit changes the following file:
Now, here's the deadlock:
cass_future_error_code
which in turn callsFuture::error
which then callsFuture::internal_wait
that waits onis_set_
Future::internal_set
is called, which then proceeds to call the callback.cass_future_error_code
, completing the deadlock as it now waits forever onis_set_
.is_set_
is supposed to be set AFTER callbacks are processed, yet the callback can call public (I am assuming) API to check the status of the future, causing a deadlock to occur.