Open dfawley opened 5 years ago
I think we should start from index 0, and perform the usual RR algorithm from there. That usual algorithm is basically "while subchannel[i] is not READY, ++i; return subchannel[i]". In other words, we aren't required to start with the first subchannel in the list if it's not the first one to become READY, but we should prefer it if it is. In particular, I think we should require that once all subchannels are in state READY, we iterate over them in the order in which the addresses were given to RR by the resolver.
Yes, I think we should absolutely maintain picker state when subchannel states change, for the reasons I described in #2579. We do not want a single flapping task to cause RR to devolve to pick_random.
I do not think this is a requirement. I think it's fine to reset the index to 0 when we get a new list of addresses from the resolver.
For (1), I think the answer is "why do we care?" It literally only matters for the very first RPC of a channel and there will always be a "first transport" to become ready, so the initial index is guaranteed not to matter.
For (3), if we always start with 0 it seems it is baaaad for a flapping task causing the name resolver to add/remove the entry.
One more:
How important is this, given that the first RPC will go to the first ready subchannel, then eventually we will alternate between the first+second ready subchannels, etc. ?
For pick-first, the order here is obviously very important, but for RR it actually doesn't seem to matter to me at all.
For (3), if we always start with 0 it seems it is baaaad for a flapping task causing the name resolver to add/remove the entry.
That's a good point. In a dynamically scheduled environment, a flapping server will cause the resolver to return a new address list. Actually, this will probably be true even for DNS, since we always re-resolve when one of the subchannels gets disconnected.
Unfortunately, while I think it's clear that it makes sense to retain the index when a task's state changes, I don't think we can easily do that when we get a new list of addresses from the resolver, since the addresses may not be in the same order or even be the same size. And if we can't do that when we get a new resolver list, then it doesn't make much sense to do it when individual subchannels' states change, since the two cases are likely to be triggered by the same event.
Is there anything better we can do here? If not, maybe randomization is the least-bad answer. :(
I do admit that subchannel state flapping is a more common event, because it happens when any subchannel is closed. And that may happen for normal reasons (although... the number of reasons may be low, because MAX_CONNECTION_AGE and MAX_CONNECTION_IDLE don't make sense in rr), independent of backends coming up and down. So "not every subchannel state change causes a name resolver address list change".
So a flapping task may impact both, but we may still want the "don't randomize on subchannel state changes" for other cases.
The most obvious case of "subchannel state changes" is during startup. If you have 100 subchannels to rr over, when the 50th connection is made... I still can't prove that random is wrong. Also, startup already has the problem that we'll bombard the first subchannel that comes up, so any nuanced discussions of "fairness" are probably not worth our time. (I'm also not sure how I feel about maintaining the index when we may be iterating over all 100 items in the list every single RPC since there is only one connection ready; that's not that big of a deal in some ways, the same as random...)
I think I'm in the same position as before: during subchannel state changes maintaining the previous index is probably nice, but it's unclear whether it would actually matter to anyone.
MAX_CONNECTION_AGE and MAX_CONNECTION_IDLE don't make sense in rr
I don't think that's true. In fact, you're the one who has advocated for using this approach to allow round_robin clients to discover new backends; see discussion in grpc/grpc#12295. :)
The most obvious case of "subchannel state changes" is during startup. If you have 100 subchannels to rr over, when the 50th connection is made... I still can't prove that random is wrong.
That's a great example. I think it's pretty clear that that's the wrong thing to do. Each time we get a new connection, we should start distributing load evenly across those connections. Otherwise, round_robin devolves to pick_random.
Also, startup already has the problem that we'll bombard the first subchannel that comes up, so any nuanced discussions of "fairness" are probably not worth our time.
I think there's a difference between saying "we won't try to be fair at all" and "we'll do the best we can based on the set of subchannels that are currently connected". The latter seems like a much more reasonable stance.
I think I'm in the same position as before: during subchannel state changes maintaining the previous index is probably nice, but it's unclear whether it would actually matter to anyone.
While it may be that many people will not notice, I would be very surprised if this never causes a problem for anyone.
In fact, you're the one who has advocated for using this approach to allow round_robin clients to discover new backends
You're right :-). Not MAX_CONNECTION_IDLE though. But yes, it would be normal for MAX_CONNECTION_AGE.
Each time we get a new connection, we should start distributing load evenly across those connections. Otherwise, round_robin devolves to pick_random.
I don't know what "distribute load evenly" in that situation means. I feel like you have a definition in mind that is circular. We know the load won't be even initially. It takes time to even out.
I think there's a difference between saying "we won't try to be fair at all" and "we'll do the best we can based on the set of subchannels that are currently connected". The latter seems like a much more reasonable stance.
That's an overly a black and white view of the situation. Always using '0' would be the "we won't try to be fair at all". There's a spectrum here, and your proposed solution isn't fair in an absolute sense either as the new backends would have fewer RPCs.
I think there's a difference between "the solution works reasonably well" and "there is a more 'correct' solution." The only reason to care about more 'correct' solutions is because they provide reasonable benefit to users for a reasonable cost.
While it may be that many people will not notice, I would be very surprised if this never causes a problem for anyone.
If it causes one user a problem once for one second, then I don't have enough time to care. I think what's not being communicated is how severe the problem would be and for whom.
One more:
- Maintain the same order provided by the name resolver. How important is this
One related thing that does matter is that if an address is deduplicated (e.g. the address list is A, B, B, C), then the picker should return B twice as often as either A or C.
@markdroth @ejona86
We actually need to define the requirements before beginning work here.
Are these requirements?