hyperium / tonic

A native gRPC client & server implementation with async/await support.
https://docs.rs/tonic
MIT License
9.76k stars 997 forks source link

Client with a `balance_channel` keeps returning errors after servers have started. #1147

Open fmassot opened 1 year ago

fmassot commented 1 year ago

Bug Report

Version

tonic-balance-channel-with-lazy-connection v0.1.0 (/Users/francois.massot@contentsquare-ext.com/Documents/quickwit/repos/tonic-balance-channel-with-lazy-connection)
└── tonic v0.8.2 (https://github.com/hyperium/tonic#33e22bbc)
└── tonic-build v0.8.2 (https://github.com/hyperium/tonic#33e22bbc)

Platform

Darwin MacBook-Pro-de-Francois.local 21.6.0 Darwin Kernel Version 21.6.0: Mon Aug 22 20:19:52 PDT 2022; root:xnu-8020.140.49~2/RELEASE_ARM64_T6000 arm64

Crates

tonic

Description

When using a client with a balance_channel with N endpoints, I observed the following:

I wrote a test here: https://github.com/fmassot/tonic-balance-channel-with-lazy-connection/blob/main/src/main.rs#L25

When digging into the code, the issue comes from the way errors are handled with lazy connections: https://github.com/hyperium/tonic/blob/master/tonic/src/transport/service/reconnect.rs#L63-L155

Note that if there is one endpoint, there is no problem. This comes from the fact that, when servers are not started, the poll_ready will be executed on all Reconnect... and we will end up with N Reconnect instances with errors. Only one will answer and return the error. The other N-1 will keep the error and return it on the next call.

Solution

I can push a PR to fix that but I want to have your feedback before.

LucioFranco commented 1 year ago

Sorry for the delay on this! Yeah, I don't have any extra insight than what you have here so if there is a fix that doesn't break other things I am happy to take a look at that!