rejecting client for exceeding maximum RPC connections

EtienneBruines commented 1 year ago

Nomad version

Nomad v1.5.5 BuildDate 2023-05-05T12:50:14Z Revision 3d63bc62b35cbe3f79cdd245d50b61f130ee1a79

Problem already occurred before at v1.5.3.

Operating system and Environment details

Ubuntu 22.04.02 LTS

Issue

Nomad Servers refusing each others' connections.

Reproduction steps

Unknown.

It has happened for the 4th time in 2 months now.

Expected Result

Heartbeats to other servers to always work regardless of limits
old RPC connections to be removed / garbage collected

Actual Result

Heartbeats failing due to maximum RPC connections. This causes a cluster-wide outage, because nodes are unable to reliably communicate.

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

On the receiving end (this is 172.18.1.102):

    2023-05-15T05:41:55.438Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:52644->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:41:55.438Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:52644->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:41:55.438Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:52644->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:41:55.893Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:48746 limit=100
    2023-05-15T05:41:56.545Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:48756 limit=100
    2023-05-15T05:41:56.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:48758 limit=100
    2023-05-15T05:41:57.182Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:48760 limit=100
    2023-05-15T05:41:57.870Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:48766 limit=100
    2023-05-15T05:41:58.574Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42552 limit=100
    2023-05-15T05:41:58.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42566 limit=100
    2023-05-15T05:41:59.221Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42568 limit=100
    2023-05-15T05:41:59.877Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42584 limit=100
    2023-05-15T05:42:00.520Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42598 limit=100
    2023-05-15T05:42:00.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42606 limit=100
    2023-05-15T05:42:01.196Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42618 limit=100
    2023-05-15T05:42:01.890Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42628 limit=100
    2023-05-15T05:42:02.557Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42636 limit=100
    2023-05-15T05:42:02.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42648 limit=100
    2023-05-15T05:42:03.191Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42656 limit=100
    2023-05-15T05:42:03.881Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42666 limit=100
    2023-05-15T05:42:04.590Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:35126->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.19.1.103:4647
    2023-05-15T05:42:04.590Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:35126->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.19.1.103:4647
    2023-05-15T05:42:04.590Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:35126->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:04.597Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42678 limit=100
    2023-05-15T05:42:04.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42690 limit=100
    2023-05-15T05:42:05.248Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42694 limit=100
    2023-05-15T05:42:05.270Z [ERROR] worker: failed to dequeue evaluation: worker_id=c01644e8-bc43-8a0c-c54d-e2dea770f13a error="rpc error: failed to get conn: EOF"
    2023-05-15T05:42:05.271Z [ERROR] worker: failed to dequeue evaluation: worker_id=4575e013-30bc-6f29-766e-8171ed0355d2 error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
    2023-05-15T05:42:05.949Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42704 limit=100
    2023-05-15T05:42:06.602Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42714 limit=100
    2023-05-15T05:42:06.993Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42728 limit=100
    2023-05-15T05:42:07.239Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42732 limit=100
    2023-05-15T05:42:07.923Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42734 limit=100
    2023-05-15T05:42:08.643Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38350 limit=100
    2023-05-15T05:42:08.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38364 limit=100
    2023-05-15T05:42:09.300Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38368 limit=100
    2023-05-15T05:42:10.019Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38376 limit=100
    2023-05-15T05:42:10.698Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38380 limit=100
    2023-05-15T05:42:10.991Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38384 limit=100
    2023-05-15T05:42:11.100Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: read tcp 172.18.1.102:49290->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.16.1.101:4647
    2023-05-15T05:42:11.101Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: read tcp 172.18.1.102:49290->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.16.1.101:4647
    2023-05-15T05:42:11.101Z [ERROR] client: error updating allocations: error="rpc error: failed to get conn: read tcp 172.18.1.102:49290->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:11.406Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:38396 limit=100

root@hashivault01-ham: /etc/nomad.d
# nomad server members                                                                                                                                                                                                        [5:44:55]
Name                     Address       Port  Status  Leader  Raft Version  Build  Datacenter  Region
hashivault02-del.global  172.18.1.102  4648  alive   false   3             1.5.5  del         global
hashivault03-haj.global  172.19.1.103  4648  alive   false   3             1.5.5  haj         global
hashivault01-ham.global  172.16.1.101  4648  alive   true    3             1.5.5  ham         global

On the sending end (this is 172.16.1.101):

    2023-05-15T05:42:34.994Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault02-del.global error="rpc error: failed to get conn: read tcp 172.16.1.101:56592->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:35.349Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:53538->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:35.352Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.18.1.102:52060 limit=100
    2023-05-15T05:42:35.552Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:56600->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:35.923Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.19.1.103:47358 limit=100
    2023-05-15T05:42:35.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault02-del.global error="context deadline exceeded"
    2023-05-15T05:42:35.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault03-haj.global error="context deadline exceeded"
    2023-05-15T05:42:36.005Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:53546->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:36.255Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:56602->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:36.653Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:53556->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:36.937Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:56616->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:36.993Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault03-haj.global error="rpc error: failed to get conn: read tcp 172.16.1.101:53568->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:36.995Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault02-del.global error="rpc error: failed to get conn: read tcp 172.16.1.101:56630->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:37.299Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:53574->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:37.663Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:56636->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:37.933Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:53580->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:37.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault02-del.global error="context deadline exceeded"
    2023-05-15T05:42:37.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault03-haj.global error="context deadline exceeded"
    2023-05-15T05:42:38.325Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58590->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:38.413Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.18.1.102:52066 limit=100
    2023-05-15T05:42:38.552Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.18.1.102:52080 limit=100
    2023-05-15T05:42:38.572Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52072->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:38.993Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault03-haj.global error="rpc error: failed to get conn: read tcp 172.16.1.101:52074->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:38.994Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault02-del.global error="rpc error: failed to get conn: read tcp 172.16.1.101:58600->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:39.022Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58606->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:39.222Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52076->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:39.458Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.19.1.103:34052 limit=100
    2023-05-15T05:42:39.723Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58614->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:39.944Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52092->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:39.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault02-del.global error="context deadline exceeded"
    2023-05-15T05:42:39.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault03-haj.global error="context deadline exceeded"
    2023-05-15T05:42:40.402Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58626->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:40.586Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52098->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:40.993Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault03-haj.global error="rpc error: failed to get conn: read tcp 172.16.1.101:52108->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:40.995Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault02-del.global error="rpc error: failed to get conn: read tcp 172.16.1.101:58640->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:41.106Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58654->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:41.111Z [ERROR] nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.18.1.102:52086 limit=100
    2023-05-15T05:42:41.283Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52124->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:41.802Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58660->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:41.959Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52136->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:41.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault03-haj.global error="context deadline exceeded"
    2023-05-15T05:42:41.964Z [WARN]  nomad.stats_fetcher: failed retrieving server health: server=hashivault02-del.global error="context deadline exceeded"
    2023-05-15T05:42:42.440Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.18.1.102:4647 backoff time=500ms error="read tcp 172.16.1.101:58676->172.18.1.102:4647: read: connection reset by peer"
    2023-05-15T05:42:42.686Z [ERROR] nomad.raft: failed to heartbeat to: peer=172.19.1.103:4647 backoff time=500ms error="read tcp 172.16.1.101:52152->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:42.993Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault03-haj.global error="rpc error: failed to get conn: read tcp 172.16.1.101:52154->172.19.1.103:4647: read: connection reset by peer"
    2023-05-15T05:42:42.995Z [WARN]  nomad.stats_fetcher: error getting server health: server=hashivault02-del.global error="rpc error: failed to get conn: read tcp 172.16.1.101:58688->172.18.1.102:4647: read: connection reset by peer"

root@hashivault02-del: ~
# nomad server members                                                                                                                                                                                                        [5:45:34]
Name                     Address       Port  Status  Leader  Raft Version  Build  Datacenter  Region
hashivault02-del.global  172.18.1.102  4648  alive   false   3             1.5.5  del         global
hashivault03-haj.global  172.19.1.103  4648  alive   false   3             1.5.5  haj         global
hashivault01-ham.global  172.16.1.101  4648  alive   false   3             1.5.5  ham         global

Error determining leaders: 1 error occurred:
    * Region "global": Unexpected response code: 500 (rpc error: failed to get conn: read tcp 172.18.1.102:40442->172.16.1.101:4647: read: connection reset by peer)

FAIL

Nomad Client logs (if appropriate)

The client is apparently also unhappy.

Note that there's a delay of a few seconds between these log messages, unlike the messages at the server end.

This is 172.16.1.102

    2023-05-15T05:42:14.853Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:38606->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.19.1.103:4647
    2023-05-15T05:42:14.853Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:38606->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:21.602Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:40378->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:21.603Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:40378->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:21.603Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:40378->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:31.010Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: read tcp 172.16.1.102:55998->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.16.1.101:4647
    2023-05-15T05:42:31.010Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: read tcp 172.16.1.102:55998->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.16.1.101:4647
    2023-05-15T05:42:31.010Z [ERROR] client: error updating allocations: error="rpc error: failed to get conn: read tcp 172.16.1.102:55998->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:38.569Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:52080->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:38.569Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:52080->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:38.569Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:52080->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:39.473Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:34052->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateStatus server=172.19.1.103:4647
    2023-05-15T05:42:39.474Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:34052->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateStatus server=172.19.1.103:4647
    2023-05-15T05:42:39.474Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:34052->172.16.1.101:4647: read: connection reset by peer" period=28.78956854s
    2023-05-15T05:42:39.489Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["ham"]
    2023-05-15T05:42:39.505Z [INFO]  client.consul: discovered following servers: servers=[172.16.1.101:4647, 172.19.1.103:4647, 172.18.1.102:4647, 172.18.1.102:4647]
    2023-05-15T05:42:44.657Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: EOF" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:44.658Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: EOF" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:44.658Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: EOF"
    2023-05-15T05:42:46.573Z [DEBUG] client: updated allocations: index=478200 total=0 pulled=0 filtered=0
    2023-05-15T05:42:46.573Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=0
    2023-05-15T05:42:46.573Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
    2023-05-15T05:42:46.574Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: read tcp 172.16.1.102:33392->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateStatus server=172.16.1.101:4647
    2023-05-15T05:42:46.574Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: read tcp 172.16.1.102:33392->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateStatus server=172.16.1.101:4647
    2023-05-15T05:42:46.574Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: read tcp 172.16.1.102:33392->172.16.1.101:4647: read: connection reset by peer" period=26.170791746s
    2023-05-15T05:42:46.574Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=172.16.1.101:4647
    2023-05-15T05:42:46.575Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=172.16.1.101:4647
    2023-05-15T05:42:46.575Z [ERROR] client: error querying node allocations: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection"
    2023-05-15T05:42:46.590Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["ham"]
    2023-05-15T05:42:46.606Z [INFO]  client.consul: discovered following servers: servers=[172.16.1.101:4647, 172.19.1.103:4647, 172.18.1.102:4647, 172.18.1.102:4647]
    2023-05-15T05:42:54.106Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:50228->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:54.106Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:50228->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.18.1.102:4647
    2023-05-15T05:42:54.106Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.18.1.102:50228->172.16.1.101:4647: read: connection reset by peer"
    2023-05-15T05:42:59.768Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:59424->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.19.1.103:4647
    2023-05-15T05:42:59.768Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:59424->172.16.1.101:4647: read: connection reset by peer" rpc=Node.UpdateAlloc server=172.19.1.103:4647
    2023-05-15T05:42:59.768Z [ERROR] client: error updating allocations: error="rpc error: rpc error: failed to get conn: read tcp 172.19.1.103:59424->172.16.1.101:4647: read: connection reset by peer"

Current workaround

We now set a unix cronjob to restart nomad on every server every day, to prevent these kinds of issues.

tgross commented 1 year ago

Hi @EtienneBruines! If we look at the log message you're seeing:

nomad.rpc: rejecting client for exceeding maximum RPC connections: remote_addr=172.16.1.101:42584 limit=100

We can see that the rpc_max_conns_per_client is set to the default 100. This usually should be more than enough, because:

Nomad clients multiplex many RPC calls over a single TCP connection, except for streaming endpoints such as log streaming which require their own connection when routed through servers

I would only expect that you're going to hit the limit of 100 if you're either having all the clients reach the servers thru a load balancer/proxy (which we would not recommend and would probably impact many clients at once) or if you're running a lot of streaming endpoints like nomad alloc exec, nomad alloc logs, or nomad alloc fs. If you're running a lot of streaming endpoints you'll want to increase the limit.

If you're not running your clients thru a load balancer and not running a lot of streaming endpoints, then it's possible we've got a bug somewhere in ensuring all requests are being correctly multiplexed on the same TCP connection. My money would be on the template block, because of how we've implemented consul-template as a library. The nomad.rpc.service_registration.read/list or nomad.rpc.variables.read/list rate metrics might give a clue as to whether that's the case -- see if there are a lot of those for a given client IP.

EtienneBruines commented 1 year ago

Thank you @tgross for your reply!

If you're not running your clients thru a load balancer and not running a lot of streaming endpoints

We are not using a load balancer (all direct connections to the IPs) and are not running any streaming endpoints.

My money would be on the template block, because of how we've implemented consul-template as a library

We do like using the template block, so that might be the culprit.

The nomad.rpc.service_registration.read/list or nomad.rpc.variables.read/list rate metrics might give a clue as to whether that's the case -- see if there are a lot of those for a given client IP.

I was not able to find any of these metrics in the nomad operator metrics output. We are using Consul and not the internal Nomad for service registration.

# nomad operator metrics -pretty | grep rpc | grep Name                                                       [14:30:40]
            "Name": "nomad.nomad.rpc.accept_conn",
            "Name": "nomad.nomad.rpc.acl.read",
            "Name": "nomad.nomad.rpc.alloc.list",
            "Name": "nomad.nomad.rpc.node.list",
            "Name": "nomad.nomad.rpc.query",
            "Name": "nomad.nomad.rpc.request",
            "Name": "nomad.nomad.rpc.status.read",
            "Name": "nomad.raft.net.rpcDecode",
            "Name": "nomad.raft.net.rpcDecode",
            "Name": "nomad.raft.net.rpcEnqueue",
            "Name": "nomad.raft.net.rpcEnqueue",
            "Name": "nomad.raft.net.rpcRespond",
            "Name": "nomad.raft.net.rpcRespond",
            "Name": "nomad.raft.rpc.appendEntries",
            "Name": "nomad.raft.rpc.processHeartbeat",

tgross commented 1 year ago

We do like using the template block, so that might be the culprit. ... I was not able to find any of these metrics in the nomad operator metrics output. We are using Consul and not the internal Nomad for service registration.

Ah, you'd only see template creating RPCs to Nomad if you were using Nomad's native services or Nomad Variables. For Consul the API requests go directly to Consul without talking to Nomad.

But I'm realizing I misunderstood the initial problem and you're seeing limits getting hit with server-to-server communications as well. And looking at the logs in detail it doesn't look like the problem is the limiter malfunctioning -- these are indeed separate remote addresses. So the RPC client is opening new TCP connections between the servers without disconnecting the old ones. Servers have similar requirements:

A server needs at least 2 TCP connections (1 Raft, 1 RPC) per peer server locally and in any federated region. Servers also need a TCP connection per routed streaming endpoint concurrently in use.

But as with the clients, the server streaming endpoints only come from operator use like nomad alloc logs or the Event Stream API. Do you have any kind of infrastructure metrics you can look at around TCP connection state leading up to the time of the rate limit? If the servers were leaking connections I'd expect to see them increase right before (or worse, gradually increase leading up to) the event where the number of connections exceeds the limit.

EtienneBruines commented 1 year ago

A server needs at least 2 TCP connections (1 Raft, 1 RPC) per peer server locally and in any federated region. Servers also need a TCP connection per routed streaming endpoint concurrently in use.

It does seem like there's a bit more connections going on (note that the output below is not from when things went haywire, but the current state right now):

# From the receiving end: netstat -natp
tcp        0      0 172.18.1.102:47346      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:51474      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:59724      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:54524      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:54448      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:37494      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:36148      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:54366      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:60724      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:43054      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:53778      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:53740      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:46842      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:55594      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:37366      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:36144      172.18.1.102:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:51686      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:39760      172.19.1.103:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:43716      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:47002      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:60428      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:34068      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:58742      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:46896      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:56106      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:56662      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:41012      172.19.1.103:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:42860      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:34496      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:51148      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:52446      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:41040      172.19.1.103:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:40878      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:51776      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:38002      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:34330      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:41028      172.19.1.103:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:45764      172.18.1.102:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:59404      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:41284      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:37578      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:49608      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp       25      0 172.18.1.102:46550      172.16.1.101:4647       CLOSE_WAIT  3419458/nomad       
tcp        0      0 172.18.1.102:45278      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:46040      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:60652      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:39420      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        1      0 172.18.1.102:46552      172.16.1.101:4647       CLOSE_WAIT  3419458/nomad       
tcp        0      0 172.18.1.102:59942      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:49622      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:38490      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:33176      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:51328      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:49136      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:59006      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:49030      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:53172      172.16.1.101:4647       ESTABLISHED 3419458/nomad       
tcp        0      0 172.18.1.102:38746      172.16.1.101:4647       ESTABLISHED 3419458/nomad       

# From the sending end, netstat -natp
tcp6       0      0 172.16.1.101:4647       172.18.1.102:54448      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:59042      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:43046      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:60428      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:37494      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:38746      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:51454      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:42860      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.101:41664      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:60724      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:49376      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:53740      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:46434      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:46254      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:46040      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:43114      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:35864      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:43716      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:37796      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:59942      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:60652      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:44920      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:54366      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:49608      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:36148      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:34496      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:36992      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:53778      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:39420      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:59724      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:56602      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:46896      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:44656      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:40878      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:38490      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:50958      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:43506      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:52994      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:34398      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:52816      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:38432      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:52446      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:58280      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:44562      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:43502      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:40524      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:47002      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:53172      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:54260      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:34068      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:37366      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:46842      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:49136      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:51686      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:53048      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:51328      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:55660      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:43054      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.101:54818      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:38002      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:55594      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:36114      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:36530      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:51474      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:53928      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:56106      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:58964      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:41252      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:41284      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:44364      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:51148      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:49030      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:56662      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:45278      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:33778      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:49622      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:59404      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:33176      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:37578      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:47464      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:60076      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:51776      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:60856      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:56654      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:47346      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:34330      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:54524      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.18.1.102:59006      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.16.1.102:38596      ESTABLISHED 81399/nomad         
tcp6       0      0 172.16.1.101:4647       172.19.1.103:36948      ESTABLISHED 81399/nomad        

# Process ID 81399 seems to be:
81399 /usr/bin/nomad agent -config /etc/nomad.d

We're talking about roughly 13 Nomad jobs in total at the moment, with no nomad alloc logs in use. But... Our monitoring does call the API at "/v1/client/fs/logs/%s". But that is the HTTP client (using the official Go library) and will be connecting to :4646 instead of :4647.

It does seem to be initiated from the same Nomad process.

tgross commented 1 year ago

But... Our monitoring does call the API at "/v1/client/fs/logs/%s". But that is the HTTP client (using the official Go library) and will be connecting to :4646 instead of :4647.

Ah ha! When you send a HTTP API request to any Nomad agent and it can't be served by that specific agent, it gets mapped to RPC calls for being bounced to the appropriate node in the cluster. So for example, suppose you've got the following topology

flowchart TD
    HTTPClient("HTTP client")
    subgraph Servers
    L("Leader")
    A("Follower A")
    B("Follower B")
    L <--> A
    A <--> B
    B <--> L
    end
    HTTPClient --> B
    ClientA("Client A")
    ClientA <--> A

Suppose the HTTP client sends the /v1/client/fs/logs/%s API call for an allocation that's running on Client A, and suppose it hits Follower B. That server doesn't have a connection to Client A, so it has to find a server that does. Follower B opens a streaming RPC to Follower A, which in turn opens a streaming RPC to Client A.

Because streaming RPCs unfortunately can't be multiplexed on the same TCP connection (today, this would be nice to do at some point), those are all independent connections and that's likely why you're hitting RPC limits between servers. And you'll potentially be hitting them in an uneven pattern depending on the distribution of allocations among clients and which specific servers those clients happen to be heartbeating to and which server the HTTP client is sending its request to.

So for your team where you're monitoring allocation logs, there's a couple of options you might want to consider:

Increase the limits. This is the easy option and will at least let you avoid restarting servers frequently.
Stream the alloc logs from a log shipping application running on each client, so that you don't need to go to the servers at all. With the new Task API you could even do so from a system job running on all the clients.

EtienneBruines commented 1 year ago

Increasing the limit makes sense - that's an easy fix indeed.

But... Shouldn't these get garbage collected as well? We're talking about 5-13 HTTP calls per minute (all directed towards the leader), that are all closed within that minute. That shouldn't saturate the 100 RPC limit at all, right?

tgross commented 1 year ago

But... Shouldn't these get garbage collected as well? We're talking about 5-13 HTTP calls per minute (all directed towards the leader), that are all closed within that minute. That shouldn't saturate the 100 RPC limit at all, right?

Yeah they really should be. I spent a bit of time this morning trying to find where we might be missing a call to close the connection without much luck. My suspicion is that the RPC client side (i.e. the leader opening the connection to a follower to connect the stream, in your case) is what's holding open the connection. I'll add some instrumentation to a build here and see if I can reproduce the behavior you're seeing.

tgross commented 1 year ago

@EtienneBruines I spent some time trying to reproduce what you're seeing and wasn't able to. You said there's monitoring calling the alloc logs endpoint. Is there any chance that monitoring tooling isn't closing the channel that gets passed to Alloc.Logs?

Alternately, it's entirely possible I'm barking up the wrong tree with the streaming RPCs. I'll try to see if I'm missing something there as well.

EtienneBruines commented 1 year ago

Screenshot_20230517_095610

follow is set to false, so it should exit on its own once read. I am reading the returning frames channel until it closes - and no errors are being returned. And as soon as the frames channel is closed, the queryClientNode is closed as well (not by me, but that happens in the Nomad API code). The process making the logs-call also exits within that 1 minute, and I'm guessing Go (or the linux kernel?) would close any open TCP connections when the process exists.

tgross commented 1 year ago

follow is set to false, so it should exit on its own once read. I am reading the returning frames channel until it closes - and no errors are being returned. And as soon as the frames channel is closed, the queryClientNode is closed as well (not by me, but that happens in the Nomad API code).

Ok, yeah that all looks ok to me.

The process making the logs-call also exists within that 1 minute, and I'm guessing Go (or the linux kernel?) would close any open TCP connections when the process exists.

Thanks for that clarification, I was thinking of a persistent process here. So yeah, the TCP client in Go will send FIN/FINACK when we exit. The connection will get moved to TIME_WAIT (usually for 60s), and you'll see those connections waiting around on the HTTP port 4646. That means that regardless of what your client code is doing, we're getting left with those extraneous connections on port 4647. The whole "streaming RPCs are different" could definitely be a red herring.

One more question, on 172.18.1.102 I see logs that normally I would only ever expect to see on the client, because they come from places like client/rpc.go#L100. Are you running mixed server/client nodes as servers?

EtienneBruines commented 1 year ago

One more question, on 172.18.1.102 I see logs that normally I would only ever expect to see on the client, because they come from places like client/rpc.go#L100. Are you running mixed server/client nodes as servers?

Yes.

tgross commented 1 year ago

Ok, that's going to complicate things quite a bit. We don't really encourage that kind of topology because then you have to carve out a lot of resources on the clients for the server, and that's non-trivial to do with CPU resources in particular.

But to the point of this issue, it also has some surprising behaviors in terms of the RPC system. The chunk of code that runs as client and as server are basically totally independent, so a client might have its connected server on another host entirely! So for example, you could send the alloc logs HTTP request to the leader, it could then look for the server that's connected to that client, open the streaming RPC to that server, and that server then opens a streaming RPC to the node with the client, which is another server. There's a minor optimization if the alloc logs happens to hit the server/client running the alloc in question, but otherwise all the traffic gets multiplied unnecessarily.

That being said, that still doesn't mean we should see >100 connections between servers. It just makes it a lot harder overall to debug why. I'll see if I can reproduce given that info. Thanks!

hashicorp / nomad