Closed alecthomas closed 4 years ago
@mightyguava
I'm not sure I understand which part of this you're saying is an ld-relay issue. We don't have any control over your container system's load-balancing behavior (it's not clear to me what you're using— Kubernetes?). I'm probably not correctly understanding your point about the rolling restart but it seems to me that if pods were being restarted one at a time, you would at worst end up with the connections being distributed across all but one; I can't visualize how they would get bunched up into a smaller subset, unless you were doing something like restarting half of the pods all at once.
I guess it would be possible to implement some kind of hard time limit on connections, but certainly not anything as short as 1 second, and unless it was randomized in some way I would expect it to cause a similar musical chairs problem as a bunch of connections would all get restarted at once.
The container system is Kubernetes, but we do have an AWS ELB load balancing TCP traffic in front of ld-relay. Maybe ld-relay isn't doing anything wrong. But the load is definitely uneven as you can see from the graph.
There's a general issue for us here that, out of the box, putting an established load balancer like an ELB in front of it does not end up distributing traffic evenly.
Currently, in steady state, with 10 replicas, the busiest one has 144 conns (according to ld_relay_connections
) and uses 214MB, while the most idle one has 24 conns and uses 70MB. That's about 1.2MB per connection. I assume that's correlated with the size of our flagset data. We'll be growing both the number and complexity of flags and number of services using them significantly in the future. This isn't going to be scalable for very long.
We are looking for a solution here, whether that be the relay somehow automatically redistributing connections, or implementing timeouts so that connections can be shed. Another potential idea here is for the relay to have some awareness of how much memory it has (which can be inferred from cgroup limits), and limit the number of connections it accepts. We were seeing that, when the busiest replicas ran out of memory, we went into cascading failure as the newly shifted connections kept overloading the next busiest replica.
the busiest one has 144 conns (according to ld_relay_connections) and uses 214MB, while the most idle one has 24 conns and uses 70MB. That's about 1.2MB per connection. I assume that's correlated with the size of our flagset data
Thanks for mentioning this - it sounds like there may be a separate issue, because it's not supposed to behave that way. That is, the way ld-relay is implemented it should be sharing a single flag data store for all connections, not copying the data set for every connection (it does have to temporarily copy data to build the initial stream event when a connection is made, but that is not retained).
So we'll have to take a closer look at what is being allocated. This is the first report we've had of memory use varying so greatly between instances.
Also, pardon my momentary confusion about your rolling restart scenario. Indeed, it makes sense that if you started with an equal distribution and then restarted instances one at a time and each one's connections got redistributed evenly among the rest, you would end up with an unequal distribution heavily weighted toward the instances that got restarted earlier. It took me a minute to see that though, because for whatever reason we do not (as far as I know) have a similar issue when we do rolling restarts of the stream service. But that probably just means I'm less familiar with our back-end architecture than I am with the SDKs and Relay.
Sorry that this issue has gone a long time without an update. We are getting ready to release Relay Proxy 6.0.0 fairly soon, and it includes the ability to set a maximum lifetime for stream connections from SDKs to Relay, so that Relay will automatically disconnect the client after that amount of time forcing it to reconnect and be redistributed by the load-balancer.
To the related comment about Relay using an unexpectedly large amount of memory per connection, unfortunately we have not been able to reproduce this and haven't found anything in the code that would account for it.
The 6.0.0 release is now available. It supports a new configuration option— maxClientConnectionTime
in the [Main]
section of the config file, or MAX_CLIENT_CONNECTION_TIME
if using environment variables— providing the behavior I mentioned in the previous comment.
@alecthomas @mightyguava I'll close the issue now, but please feel free to reopen it if you have questions or problems regarding this feature.
Thanks for the update! We’ll try it out soon
Describe the bug
We experienced an ld-relay outage today that we believe was due to very unevenly distributed connections across ld-relay pods. This chart illustrates this quite clearly:
The distribution is quite large, ranging from 257 connections down to 51. There are around 1800 total connections.
In addition to the uneven steady state distribution, you can see the large spike in connections during a rolling restart. As pods terminate, their connections are moved to the remaining live pods, which keep them and do not redistribute them to other pods.
To reproduce Run 10+ instances of ld-relay with many inbound connections.
Expected behavior
Connections to be evenly distributed.
We've experienced similar behaviour with long running HTTP/2 connections, and our solution was to actively terminate the connections after some time period (eg. 500ms, 1s). This was for a very high QPS service though, so that might defeat the purpose of LD using SSE.
Relay version
Language version, developer tools go1.13.5