Open EtienneBruines opened 1 year ago
Hi @EtienneBruines! If we look at the log message you're seeing:
nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42584 limit=100
We can see that the rpc_max_conns_per_client
is set to the default 100. This usually should be more than enough, because:
Nomad clients multiplex many RPC calls over a single TCP connection, except for streaming endpoints such as log streaming which require their own connection when routed through servers
I would only expect that you're going to hit the limit of 100 if you're either having all the clients reach the servers thru a load balancer/proxy (which we would not recommend and would probably impact many clients at once) or if you're running a lot of streaming endpoints like nomad alloc exec
, nomad alloc logs
, or nomad alloc fs
. If you're running a lot of streaming endpoints you'll want to increase the limit.
If you're not running your clients thru a load balancer and not running a lot of streaming endpoints, then it's possible we've got a bug somewhere in ensuring all requests are being correctly multiplexed on the same TCP connection. My money would be on the template
block, because of how we've implemented consul-template
as a library. The nomad.rpc.service_registration.read/list
or nomad.rpc.variables.read/list
rate metrics might give a clue as to whether that's the case -- see if there are a lot of those for a given client IP.
Thank you @tgross for your reply!
If you're not running your clients thru a load balancer and not running a lot of streaming endpoints
We are not using a load balancer (all direct connections to the IPs) and are not running any streaming endpoints.
My money would be on the
template
block, because of how we've implementedconsul-template
as a library
We do like using the template
block, so that might be the culprit.
The
nomad.rpc.service_registration.read/list
ornomad.rpc.variables.read/list
rate metrics might give a clue as to whether that's the case -- see if there are a lot of those for a given client IP.
I was not able to find any of these metrics in the nomad operator metrics
output. We are using Consul and not the internal Nomad for service registration.
# nomad operator metrics -pretty | grep rpc | grep Name [14:30:40]
"Name": "nomad.nomad.rpc.accept_conn",
"Name": "nomad.nomad.rpc.acl.read",
"Name": "nomad.nomad.rpc.alloc.list",
"Name": "nomad.nomad.rpc.node.list",
"Name": "nomad.nomad.rpc.query",
"Name": "nomad.nomad.rpc.request",
"Name": "nomad.nomad.rpc.status.read",
"Name": "nomad.raft.net.rpcDecode",
"Name": "nomad.raft.net.rpcDecode",
"Name": "nomad.raft.net.rpcEnqueue",
"Name": "nomad.raft.net.rpcEnqueue",
"Name": "nomad.raft.net.rpcRespond",
"Name": "nomad.raft.net.rpcRespond",
"Name": "nomad.raft.rpc.appendEntries",
"Name": "nomad.raft.rpc.processHeartbeat",
We do like using the template block, so that might be the culprit. ... I was not able to find any of these metrics in the nomad operator metrics output. We are using Consul and not the internal Nomad for service registration.
Ah, you'd only see template
creating RPCs to Nomad if you were using Nomad's native services or Nomad Variables. For Consul the API requests go directly to Consul without talking to Nomad.
But I'm realizing I misunderstood the initial problem and you're seeing limits getting hit with server-to-server communications as well. And looking at the logs in detail it doesn't look like the problem is the limiter malfunctioning -- these are indeed separate remote addresses. So the RPC client is opening new TCP connections between the servers without disconnecting the old ones. Servers have similar requirements:
A server needs at least 2 TCP connections (1 Raft, 1 RPC) per peer server locally and in any federated region. Servers also need a TCP connection per routed streaming endpoint concurrently in use.
But as with the clients, the server streaming endpoints only come from operator use like nomad alloc logs
or the Event Stream API. Do you have any kind of infrastructure metrics you can look at around TCP connection state leading up to the time of the rate limit? If the servers were leaking connections I'd expect to see them increase right before (or worse, gradually increase leading up to) the event where the number of connections exceeds the limit.
A server needs at least 2 TCP connections (1 Raft, 1 RPC) per peer server locally and in any federated region. Servers also need a TCP connection per routed streaming endpoint concurrently in use.
It does seem like there's a bit more connections going on (note that the output below is not from when things went haywire, but the current state right now):
# From the receiving end: netstat -natp
tcp 0 0 172.18.1.102:47346 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:51474 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:59724 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:54524 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:54448 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:37494 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:36148 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:54366 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:60724 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:43054 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:53778 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:53740 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:46842 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:55594 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:37366 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:36144 172.18.1.102:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:51686 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:39760 172.19.1.103:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:43716 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:47002 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:60428 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:34068 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:58742 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:46896 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:56106 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:56662 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:41012 172.19.1.103:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:42860 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:34496 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:51148 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:52446 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:41040 172.19.1.103:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:40878 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:51776 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:38002 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:34330 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:41028 172.19.1.103:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:45764 172.18.1.102:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:59404 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:41284 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:37578 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:49608 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 25 0 172.18.1.102:46550 172.16.1.101:4647 CLOSE_WAIT 3419458/nomad
tcp 0 0 172.18.1.102:45278 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:46040 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:60652 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:39420 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 1 0 172.18.1.102:46552 172.16.1.101:4647 CLOSE_WAIT 3419458/nomad
tcp 0 0 172.18.1.102:59942 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:49622 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:38490 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:33176 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:51328 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:49136 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:59006 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:49030 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:53172 172.16.1.101:4647 ESTABLISHED 3419458/nomad
tcp 0 0 172.18.1.102:38746 172.16.1.101:4647 ESTABLISHED 3419458/nomad
# From the sending end, netstat -natp
tcp6 0 0 172.16.1.101:4647 172.18.1.102:54448 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:59042 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:43046 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:60428 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:37494 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:38746 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:51454 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:42860 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.101:41664 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:60724 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:49376 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:53740 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:46434 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:46254 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:46040 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:43114 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:35864 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:43716 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:37796 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:59942 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:60652 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:44920 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:54366 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:49608 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:36148 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:34496 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:36992 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:53778 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:39420 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:59724 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:56602 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:46896 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:44656 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:40878 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:38490 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:50958 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:43506 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:52994 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:34398 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:52816 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:38432 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:52446 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:58280 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:44562 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:43502 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:40524 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:47002 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:53172 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:54260 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:34068 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:37366 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:46842 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:49136 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:51686 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:53048 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:51328 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:55660 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:43054 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.101:54818 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:38002 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:55594 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:36114 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:36530 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:51474 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:53928 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:56106 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:58964 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:41252 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:41284 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:44364 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:51148 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:49030 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:56662 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:45278 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:33778 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:49622 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:59404 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:33176 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:37578 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:47464 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:60076 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:51776 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:60856 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:56654 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:47346 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:34330 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:54524 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.18.1.102:59006 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.16.1.102:38596 ESTABLISHED 81399/nomad
tcp6 0 0 172.16.1.101:4647 172.19.1.103:36948 ESTABLISHED 81399/nomad
# Process ID 81399 seems to be:
81399 /usr/bin/nomad agent -config /etc/nomad.d
We're talking about roughly 13 Nomad jobs in total at the moment, with no nomad alloc logs
in use. But... Our monitoring does call the API at "/v1/client/fs/logs/%s"
. But that is the HTTP client (using the official Go library) and will be connecting to :4646
instead of :4647
.
It does seem to be initiated from the same Nomad process.
But... Our monitoring does call the API at
"/v1/client/fs/logs/%s"
. But that is the HTTP client (using the official Go library) and will be connecting to:4646
instead of:4647
.
Ah ha! When you send a HTTP API request to any Nomad agent and it can't be served by that specific agent, it gets mapped to RPC calls for being bounced to the appropriate node in the cluster. So for example, suppose you've got the following topology
flowchart TD
HTTPClient("HTTP client")
subgraph Servers
L("Leader")
A("Follower A")
B("Follower B")
L <--> A
A <--> B
B <--> L
end
HTTPClient --> B
ClientA("Client A")
ClientA <--> A
Suppose the HTTP client sends the /v1/client/fs/logs/%s
API call for an allocation that's running on Client A, and suppose it hits Follower B. That server doesn't have a connection to Client A, so it has to find a server that does. Follower B opens a streaming RPC to Follower A, which in turn opens a streaming RPC to Client A.
Because streaming RPCs unfortunately can't be multiplexed on the same TCP connection (today, this would be nice to do at some point), those are all independent connections and that's likely why you're hitting RPC limits between servers. And you'll potentially be hitting them in an uneven pattern depending on the distribution of allocations among clients and which specific servers those clients happen to be heartbeating to and which server the HTTP client is sending its request to.
So for your team where you're monitoring allocation logs, there's a couple of options you might want to consider:
Increasing the limit makes sense - that's an easy fix indeed.
But... Shouldn't these get garbage collected as well? We're talking about 5-13 HTTP calls per minute (all directed towards the leader), that are all closed within that minute. That shouldn't saturate the 100 RPC limit at all, right?
But... Shouldn't these get garbage collected as well? We're talking about 5-13 HTTP calls per minute (all directed towards the leader), that are all closed within that minute. That shouldn't saturate the 100 RPC limit at all, right?
Yeah they really should be. I spent a bit of time this morning trying to find where we might be missing a call to close the connection without much luck. My suspicion is that the RPC client side (i.e. the leader opening the connection to a follower to connect the stream, in your case) is what's holding open the connection. I'll add some instrumentation to a build here and see if I can reproduce the behavior you're seeing.
@EtienneBruines I spent some time trying to reproduce what you're seeing and wasn't able to. You said there's monitoring calling the alloc logs endpoint. Is there any chance that monitoring tooling isn't closing the channel that gets passed to Alloc.Logs
?
Alternately, it's entirely possible I'm barking up the wrong tree with the streaming RPCs. I'll try to see if I'm missing something there as well.
follow
is set to false
, so it should exit on its own once read. I am reading the returning frames
channel until it closes - and no errors are being returned. And as soon as the frames
channel is closed, the queryClientNode
is closed as well (not by me, but that happens in the Nomad API code). The process making the logs-call also exits within that 1 minute, and I'm guessing Go (or the linux kernel?) would close any open TCP connections when the process exists.
follow
is set tofalse
, so it should exit on its own once read. I am reading the returningframes
channel until it closes - and no errors are being returned. And as soon as theframes
channel is closed, thequeryClientNode
is closed as well (not by me, but that happens in the Nomad API code).
Ok, yeah that all looks ok to me.
The process making the logs-call also exists within that 1 minute, and I'm guessing Go (or the linux kernel?) would close any open TCP connections when the process exists.
Thanks for that clarification, I was thinking of a persistent process here. So yeah, the TCP client in Go will send FIN/FINACK when we exit. The connection will get moved to TIME_WAIT
(usually for 60s), and you'll see those connections waiting around on the HTTP port 4646. That means that regardless of what your client code is doing, we're getting left with those extraneous connections on port 4647. The whole "streaming RPCs are different" could definitely be a red herring.
One more question, on 172.18.1.102
I see logs that normally I would only ever expect to see on the client, because they come from places like client/rpc.go#L100
. Are you running mixed server/client nodes as servers?
One more question, on 172.18.1.102 I see logs that normally I would only ever expect to see on the client, because they come from places like client/rpc.go#L100. Are you running mixed server/client nodes as servers?
Yes.
Ok, that's going to complicate things quite a bit. We don't really encourage that kind of topology because then you have to carve out a lot of resources on the clients for the server, and that's non-trivial to do with CPU resources in particular.
But to the point of this issue, it also has some surprising behaviors in terms of the RPC system. The chunk of code that runs as client and as server are basically totally independent, so a client might have its connected server on another host entirely! So for example, you could send the alloc logs
HTTP request to the leader, it could then look for the server that's connected to that client, open the streaming RPC to that server, and that server then opens a streaming RPC to the node with the client, which is another server. There's a minor optimization if the alloc logs
happens to hit the server/client running the alloc in question, but otherwise all the traffic gets multiplied unnecessarily.
That being said, that still doesn't mean we should see >100 connections between servers. It just makes it a lot harder overall to debug why. I'll see if I can reproduce given that info. Thanks!
Nomad version
Nomad v1.5.5 BuildDate 2023-05-05T12:50:14Z Revision 3d63bc62b35cbe3f79cdd245d50b61f130ee1a79
Problem already occurred before at v1.5.3.
Operating system and Environment details
Ubuntu 22.04.02 LTS
Issue
Nomad Servers refusing each others' connections.
Reproduction steps
Unknown.
It has happened for the 4th time in 2 months now.
Expected Result
Actual Result
Heartbeats failing due to maximum RPC connections. This causes a cluster-wide outage, because nodes are unable to reliably communicate.
Job file (if appropriate)
Not applicable.
Nomad Server logs (if appropriate)
On the receiving end (this is
172.18.1.102
):On the sending end (this is
172.16.1.101
):Nomad Client logs (if appropriate)
The client is apparently also unhappy.
Note that there's a delay of a few seconds between these log messages, unlike the messages at the server end.
This is
172.16.1.102
Current workaround
We now set a unix cronjob to restart
nomad
on every server every day, to prevent these kinds of issues.