uber / tchannel

network multiplexing and framing protocol for RPC
MIT License
1.15k stars 129 forks source link

Peer selection should consider liveness #1342

Open anson627 opened 8 years ago

anson627 commented 8 years ago

Today our peer selection algorithm only takes account of freshness (identified, ready, fresh, unconnected) and load (random weighted sum of # connections and # requests), which works well in the happy case - all peers are alive.

But whenever a remote peer dies (e.g server tchannel drain or close), the corresponding peer of client tchannel has zero connections and requests, and has been already identified, and therefore have higher score and more likely be selected for new request or request retry.

Ideally the dead peer should be removed from tchannel.

@Raynos @ShanniLi @jcorbin

Raynos commented 8 years ago

We have talked about penalizing a Peer for error frames by incrementing its pending count.

Raynos commented 8 years ago

cc @kriskowal

ShanniLi commented 8 years ago

If a peer has no connection, it would not be considered identified. It should have lower score compared to identified peers.

On the other hand, liveness also means connection qualities, e.g., error frame count. In general, it is easy to track error frames. However, the recovering part is a little harder, i.e., when everything goes on well, how to reset the error frame count to zero.

anson627 commented 8 years ago

You are right, looking at the code, if there is no connections, getTier returns UNCONNECTED, the score is [0.1, 0.4) range, so it won't be selected if there are still identified peers.

Yeah, I would image some thing like moving window to keep tracking of # of error frames, it fades away along with time. Like what you did in rate limiter.

Torwegia commented 8 years ago

This makes iterating on a new project really rough.

Raynos commented 8 years ago

@Torwegia please explain more.