Open coryan opened 4 years ago
Carlos and I just chatted about this a bit offline. I'm going to try to summarize what I think we want/need here, and this will help expose my own misunderstandings.
We need to make sure our retry policies all properly set timeouts for both sync and async APIs. This should work for all services, whether using grpc or REST. As described in the comment above, Bigtable uses its own RPCRetryPolicy
(not the common one), and Bigtable's policy does set timeouts for sync RPC, but not async. Ideally, this would be handled by the common retry policy and Bigtable would not need to provide its own.
It's difficult to unit test with any retry policy because they sleep
. It would be nice to provide a hook where tests can mock/fake out the sleep part so that unit tests will run faster and be less flaky. This should work for both sync and async retries.
It may not be possible to make the above changes without breaking the retry policy API, which is exposed publicly to users (spanner example). While we want to avoid API breakages, it may be necessary in this case.
A change to how retries work and the design of our retries needs a design doc.
How urgent is this? It's not work-on-the-weekends urgent, but it's pretty important. We currently have shipping products that don't have proper timeouts. We're about to ship Pub/Sub, which will have the same issues (because it's using the common retry policies), and it would be nice to not break Pub/Sub's API right after it GAs (let's try to do that before). So I think the urgency is that someone should own this bug and it should probably be their top priority until its finished.
nit: the retry policies do not sleep today, the retry loop does, and the sleep time comes from the backoff policy. Your bigger point about this making retry loops hard to test stands.
@devbww any update on this?
any update on this?
No. Indeed, I am only starting to work on it directly today.
Turns out #5959 might be directly related to this.
FWIW, after #6014 the deadline is being set for asynchronous operations in the Cloud Bigtable client library, and the retry loops can handle any retry policy that has a Setup(grpc::ClientContext&)
member function.
Note that the general problem of how to rationalize our retry and backoff policies remains open.
(internal only) Brainstorm doc
We need to make sure our retry policies all properly set timeouts for both sync and async APIs.
One of the ideas in the above brain-storming document is not to use the retry policy for RPC deadlines. The amount of time you're willing to wait to start an RPC (retry time limit) is distinct from the amount of time you're happy to let it run (deadline).
Rather, a functor option can be added where the user can set the grpc::ClientContext
(deadline + ...) directly.
Still need this. @dbolduc has some thoughts here too.
As @devbww mentioned, we have decided to separate the policies from the grpc::ClientContext
configuration. We know want to capture all functionality listed here via Options
(which may be per-call / per-client / per-connection). At the moment, not all libraries can do this.
We are happy with the common options for retry/backoff/polling policies as they are. In #7529 we handled the discrepancy between bigtable policies and common policies by using the generic g::c::internal::GrpcSetupOption
to act on the context (under the hood).
I think we will likely want to release that option in our public API. I think we will definitely want to offer a
// lives in google_cloud_cpp_common. Can apply to REST and gRPC.
struct DeadlineOption {
using Type = std::chrono::milliseconds;
};
... and a
// lives in google_cloud_cpp_common. Can apply to REST and gRPC.
struct MetadataOption {
using Type = std::map<std::string, std::string>;
};
... for convenience. These options would take effect here for gRPC, and idk where for REST.
We still want this, but not scheduled yet.
Let's consider this for 2023/Q4.
Included in the 2024 roadmap document.
We still want to do this in order to increase feature parity with other Cloud SDK libraries.
The
google::cloud::bigtable::RPCRetryPolicy
classes have special member functions to setup a timeout:https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/bigtable/rpc_retry_policy.h#L98-L101
https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/bigtable/rpc_retry_policy.cc#L53-L57
This is used in the synchronous retry loop, e.g.:
https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/bigtable/table.cc#L90-L95
But not in the asynchronous retry loop:
https://github.com/googleapis/google-cloud-cpp/blob/81adbdabca2da794875fd567591fddc2cf0c8394/google/cloud/internal/async_retry_unary_rpc.h#L145-L149
That asynchronous retry loop is used in spanner too, and the
RetryPolicy
for spanner does not have aSetup()
member function. Furthermore, the same retry policy class is used in storage, where we would't want to introduce it (because that would create an unwanted dependency on gRPC).We need a cleaner way to setup per-RPC timeouts across all services, with good support for REST, and we need to use that on the asynchronous and synchronous loops.