rigetti / qcs-sdk-rust

Rust SDK for Rigetti Quantum Cloud Services (QCS)
https://docs.rs/qcs
13 stars 5 forks source link

Improve handling of gRPC failures #471

Open erichulburd opened 5 months ago

erichulburd commented 5 months ago

Over the past few months, I've seen a variety of different gRPC status failures that should be retryable on the client side. A most recent example:

QpuApiError                               Traceback (most recent call last)
...
    212 """Execute a job and return the shots."""
    213 job_id = submit(
    214     program=executable.program,
    215     patch_values=patch,
   (...)
    218     execution_options=self.execution_options,
    219 )
--> 220 return retrieve_results(
    221     job_id=job_id,
    222     quantum_processor_id=self.device_name,
    223     client=self.qcs_client,
    224     execution_options=self.execution_options,
    225 )

QpuApiError: Call failed during gRPC request: status: Unavailable, message: "error trying to connect: Unsuccessful reply: TtlExpired", details: [], metadata: MetadataMap { headers: {} }

It's difficult to diagnose and handle errors of this nature in Python as the QCS SDK is currently structured. I advocate consideration for the following:

  1. Supporting retry configuration on all gRPC calls - translation, execution (ie submit), and result retrieval retrieve_results. This should support retry based on gRPC status code as well as a backoff strategy - linear, exponential, max retries, etc.
  2. Surfacing gRPC exceptions to Python in a structured way. At a minimum, this should include the status code. Request id and timing data would also be nice.
  3. Configurable gRPC logging. The gRPC C API uses environment variables in a well structured and documented way: https://github.com/grpc/grpc/blob/15850972ddba9c1262a9d51341da03bc607bd934/doc/environment_variables.md
  4. A persistent handle to the gRPC channel. The way the client is currently structured, each call to translate, execute, and retrieve results instantiates a new channel (see for instance https://github.com/rigetti/qcs-sdk-rust/blob/e73f83d37ccbf966666a67d1fecefda7b19229e6/crates/lib/src/qpu/api.rs#L292 and then https://github.com/rigetti/qcs-sdk-rust/blob/main/crates/lib/src/qpu/api.rs#L525). This both adds latency and makes connections more fallible, which is contrary to the design of gRPC. If necessary, this should be achievable with some once_cell utilities: https://docs.rs/once_cell/latest/once_cell/sync/struct.Lazy.html.

If these options present inordinate technical challenges, I wonder if an alternative approach would be to interface with existing Python gRPC tooling - as in expose functions that convert Python based gRPC message objects to QCS SDK structs.