grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
20.99k stars 4.36k forks source link

Update round_robin load balancer to meet gRPC requirements #2580

Open dfawley opened 5 years ago

dfawley commented 5 years ago

@markdroth @ejona86

We actually need to define the requirements before beginning work here.

Are these requirements?

  1. Start picking at the top of the list for the first RPC.
    • Note that we will always pick the first channel we successfully connect to, and not block until address #0 is connected. So this will always be racy and unreliable even if we say it must be implemented.
  2. Maintain picker state when subchannel states change.
    • I.e. the new picker should resume picking at the next (valid) subchannel from what the old picker chose.
  3. Maintain state as much as possible when the list of addresses from the resolver changes.
    • It's unclear to me how exactly this would work. It would be nice to detail it more fully if it is required behavior.
markdroth commented 5 years ago
  1. I think we should start from index 0, and perform the usual RR algorithm from there. That usual algorithm is basically "while subchannel[i] is not READY, ++i; return subchannel[i]". In other words, we aren't required to start with the first subchannel in the list if it's not the first one to become READY, but we should prefer it if it is. In particular, I think we should require that once all subchannels are in state READY, we iterate over them in the order in which the addresses were given to RR by the resolver.

  2. Yes, I think we should absolutely maintain picker state when subchannel states change, for the reasons I described in #2579. We do not want a single flapping task to cause RR to devolve to pick_random.

  3. I do not think this is a requirement. I think it's fine to reset the index to 0 when we get a new list of addresses from the resolver.

ejona86 commented 5 years ago

For (1), I think the answer is "why do we care?" It literally only matters for the very first RPC of a channel and there will always be a "first transport" to become ready, so the initial index is guaranteed not to matter.

For (3), if we always start with 0 it seems it is baaaad for a flapping task causing the name resolver to add/remove the entry.

dfawley commented 5 years ago

One more:

  1. Maintain the same order provided by the name resolver.

How important is this, given that the first RPC will go to the first ready subchannel, then eventually we will alternate between the first+second ready subchannels, etc. ?

For pick-first, the order here is obviously very important, but for RR it actually doesn't seem to matter to me at all.

markdroth commented 5 years ago

For (3), if we always start with 0 it seems it is baaaad for a flapping task causing the name resolver to add/remove the entry.

That's a good point. In a dynamically scheduled environment, a flapping server will cause the resolver to return a new address list. Actually, this will probably be true even for DNS, since we always re-resolve when one of the subchannels gets disconnected.

Unfortunately, while I think it's clear that it makes sense to retain the index when a task's state changes, I don't think we can easily do that when we get a new list of addresses from the resolver, since the addresses may not be in the same order or even be the same size. And if we can't do that when we get a new resolver list, then it doesn't make much sense to do it when individual subchannels' states change, since the two cases are likely to be triggered by the same event.

Is there anything better we can do here? If not, maybe randomization is the least-bad answer. :(

ejona86 commented 5 years ago

I do admit that subchannel state flapping is a more common event, because it happens when any subchannel is closed. And that may happen for normal reasons (although... the number of reasons may be low, because MAX_CONNECTION_AGE and MAX_CONNECTION_IDLE don't make sense in rr), independent of backends coming up and down. So "not every subchannel state change causes a name resolver address list change".

So a flapping task may impact both, but we may still want the "don't randomize on subchannel state changes" for other cases.

The most obvious case of "subchannel state changes" is during startup. If you have 100 subchannels to rr over, when the 50th connection is made... I still can't prove that random is wrong. Also, startup already has the problem that we'll bombard the first subchannel that comes up, so any nuanced discussions of "fairness" are probably not worth our time. (I'm also not sure how I feel about maintaining the index when we may be iterating over all 100 items in the list every single RPC since there is only one connection ready; that's not that big of a deal in some ways, the same as random...)

I think I'm in the same position as before: during subchannel state changes maintaining the previous index is probably nice, but it's unclear whether it would actually matter to anyone.

markdroth commented 5 years ago

MAX_CONNECTION_AGE and MAX_CONNECTION_IDLE don't make sense in rr

I don't think that's true. In fact, you're the one who has advocated for using this approach to allow round_robin clients to discover new backends; see discussion in grpc/grpc#12295. :)

The most obvious case of "subchannel state changes" is during startup. If you have 100 subchannels to rr over, when the 50th connection is made... I still can't prove that random is wrong.

That's a great example. I think it's pretty clear that that's the wrong thing to do. Each time we get a new connection, we should start distributing load evenly across those connections. Otherwise, round_robin devolves to pick_random.

Also, startup already has the problem that we'll bombard the first subchannel that comes up, so any nuanced discussions of "fairness" are probably not worth our time.

I think there's a difference between saying "we won't try to be fair at all" and "we'll do the best we can based on the set of subchannels that are currently connected". The latter seems like a much more reasonable stance.

I think I'm in the same position as before: during subchannel state changes maintaining the previous index is probably nice, but it's unclear whether it would actually matter to anyone.

While it may be that many people will not notice, I would be very surprised if this never causes a problem for anyone.

ejona86 commented 5 years ago

In fact, you're the one who has advocated for using this approach to allow round_robin clients to discover new backends

You're right :-). Not MAX_CONNECTION_IDLE though. But yes, it would be normal for MAX_CONNECTION_AGE.

Each time we get a new connection, we should start distributing load evenly across those connections. Otherwise, round_robin devolves to pick_random.

I don't know what "distribute load evenly" in that situation means. I feel like you have a definition in mind that is circular. We know the load won't be even initially. It takes time to even out.

I think there's a difference between saying "we won't try to be fair at all" and "we'll do the best we can based on the set of subchannels that are currently connected". The latter seems like a much more reasonable stance.

That's an overly a black and white view of the situation. Always using '0' would be the "we won't try to be fair at all". There's a spectrum here, and your proposed solution isn't fair in an absolute sense either as the new backends would have fewer RPCs.

I think there's a difference between "the solution works reasonably well" and "there is a more 'correct' solution." The only reason to care about more 'correct' solutions is because they provide reasonable benefit to users for a reasonable cost.

While it may be that many people will not notice, I would be very surprised if this never causes a problem for anyone.

If it causes one user a problem once for one second, then I don't have enough time to care. I think what's not being communicated is how severe the problem would be and for whom.

dfawley commented 2 years ago

One more:

  1. Maintain the same order provided by the name resolver. How important is this

One related thing that does matter is that if an address is deduplicated (e.g. the address list is A, B, B, C), then the picker should return B twice as often as either A or C.