temporalio / sdk-dotnet

Temporal .NET SDK
MIT License
385 stars 31 forks source link

[Feature Request] Include more details from tonic from failed client_rpc_call invocations #337

Open robcao opened 2 months ago

robcao commented 2 months ago

Is your feature request related to a problem? Please describe.

Failures from client_rpc_call invocations currently only contain the message from the failure.

The stack trace from a failure to check health for instance, looks like this:

Temporalio.Exceptions.RpcException: transport error
   at Temporalio.Bridge.Client.CallAsync[T](RpcService service, String rpc, IMessage req, MessageParser`1 resp, Boolean retry, IEnumerable`1 metadata, Nullable`1 timeout, Nullable`1 cancellationToken)
   at Temporalio.Client.TemporalConnection.CheckHealthAsync(RpcService service, RpcOptions options)

In comparison, the stack trace from a failure to connect the client is more informative, and looks like this:

System.InvalidOperationException: Connection failed: Server connection error: tonic::transport::Error(Transport, ConnectError(ConnectError("tcp connect error", Os { code: 10061, kind: ConnectionRefused, message: "No connection could be made because the target machine actively refused it." })))
   at Temporalio.Bridge.Client.ConnectAsync(Runtime runtime, TemporalConnectionOptions options)
   at Temporalio.Client.TemporalConnection.GetBridgeClientAsync()

Describe the solution you'd like

I would like instances of RpcException raised from the client_rpc_callback method to contain more information about the failure, comparative to what is currently raised from failures in the client_connect method.

For example, instead of transport error, perhaps the exception could contain more information, such as

Error: Status { code: Unknown, message: "transport error", source: Some(tonic::transport::Error(Transport, hyper::Error(Io, Kind(BrokenPipe)))) }

Having this kind of additional context available is very helpful for debugging problems. Doing a cursory search shows that transport error can have multiple different root causes.

Having the gRPC status code surfaced in the error message is helpful too.

Additional context

https://github.com/temporalio/sdk-dotnet/blob/main/src/Temporalio/Bridge/src/client.rs#L258

It looks like here, the failureDetails is always being sent back to the .NET side as null. I am not sure if this is intentional or not (maybe the Rust side is always sending null), or maybe failureDetails here references the Temporal API FailureDetails proto, in which case it wouldn't be applicable at all.

I'm not certain if this will require work to be on in the core sdk first.

cretz commented 1 month ago

Sorry it took me so long to see this. This is due to the default error string of Tonic in cases where it's not a status. There is a difference between connection errors and RPC errors in this case.

perhaps the exception could contain more information, such as

Error: Status { code: Unknown, message: "transport error", source: Some(tonic::transport::Error(Transport, hyper::Error(Io, Kind(BrokenPipe)))) }

But it's not a Status with a code and a source. We will investigate and see if there's more we can extract from the non-status error beyond it's default fmt::Display.

It looks like here, the failureDetails is always being sent back to the .NET side as null

This is very specifically for gRPC "details" which is additional protobuf information, it's not for general purpose use