Open scalp42 opened 4 years ago
im experiencing the same on Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)
Confirming the problem on Nomad v0.11.3. And sometimes requests ends with empty response (EOF error).
Yes, we get the EOF all the time as well:
By default, nomad is bound to 127.0.0.1. Thus, you can only connect to the API from this address. It can be reconfigured:
nomad agent -bind 0.0.0.0
By default, nomad is bound to 127.0.0.1. Thus, you can only connect to the API from this address. It can be reconfigured:
nomad agent -bind 0.0.0.0
Unfortunately, this has nothing to do with the issue.
I can confirm the API can be reached on that interface (you would get a connection denied otherwise) and like I said it works fine most of the time.
I seem to have a similar issue. I almost exclusively use the API to interact with Nomad and occasionally it just stops responding (I have to restart Nomad to get it working again). I then get the connection reset by peer error.
I am running version v0.12.9
Still happening on latest Nomad version unfortunately 😢
Hi folks,
We're seeing errors every day when trying to query (by hand or using
nomad
cli) any kind of HTTP endpoint really, being/v1/agent/self
or/v1/acl/tokens
for example.Here's an example on the actual clients:
``` root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:19 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:19 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:20 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:20 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:21 UTC 2020 {"config":{"ACL":{"Enabled":true,"PolicyTTL":30000000000,"ReplicationToken":"","TokenTTL":30000000000},"Addresses":{"HTTP":"0.0.0.0","RPC":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com","Serf":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com"},"AdvertiseAddrs":{"HTTP":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com:4646","RPC":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com:4647","Serf":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com"},"Audit":{"Enabled":null,"Filters":null,"Sinks":null},"Autopilot":{"CleanupDeadServers":null,"DisableUpgradeMigration":null,"EnableCustomUpgrades":null,"EnableRedundancyZones":null,"LastContactThreshold":200000000,"MaxTrailingLogs":250,"MinQuorum":0,"ServerStabilizationTime":10000000000},"BindAddr":"0.0.0.0","Client":{"AllocDir":"","BindWildcardDefaultHostNetwork":true,"BridgeNetworkName":"","BridgeNetworkSubnet":"","CNIConfigDir":"","CNIPath":"","ChrootEnv":{},"ClientMaxPort":14512,"ClientMinPort":14000,"CpuCompute":0,"DisableRemoteExec":false,"Enabled":true,"GCDiskUsageThreshold":80.0,"GCInodeUsageThreshold":70.0,"GCInterval":60000000000,"GCMaxAllocs":50,"GCParallelDestroys":2,"HostNetworks":null,"HostVolumes":null,"MaxKillTimeout":"30s","MemoryMB":0,"Meta":{"chef_role":"nomad-compute","role":"nomad-compute","connect.sidecar_image":"envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09","connect.log_level":"info"},"NetworkInterface":"","NetworkSpeed":0,"NoHostUUID":false,"NodeClass":"compute","Options":{},"Reserved":{"CPU":440,"DiskMB":0,"MemoryMB":794,"ReservedPorts":""},"ServerJoin":{"RetryInterval":30000000000,"RetryJoin":["provider=aws tag_key=role tag_value=nomad-server region=us-west-2 addr_type=private_v4"],"RetryMaxAttempts":0,"StartJoin":null},"Servers":null,"StateDir":"","TemplateConfig":{"DisableSandbox":false,"FunctionBlacklist":["plugin"]}},"Consul":{"Addr":"127.0.0.1:8500","AllowUnauthenticated":true,"Auth":"","AutoAdvertise":true,"CAFile":"","CertFile":"","ChecksUseAdvertise":true,"ClientAutoJoin":true,"ClientHTTPCheckName":"Nomad Client HTTP Check","ClientServiceName":"nomad-client","EnableSSL":false,"GRPCAddr":"","KeyFile":"","ServerAutoJoin":true,"ServerHTTPCheckName":"Nomad Server HTTP Check","ServerRPCCheckName":"Nomad Server RPC Check","ServerSerfCheckName":"Nomad Server Serf Check","ServerServiceName":"nomad","ShareSSL":null,"Tags":null,"Timeout":5000000000,"Token":"We use Sensu and run InSpec checks as well to monitor certain endpoints and it's flapping all the time:
![Screen Shot 2020-08-23 at 14 17 40](https://user-images.githubusercontent.com/1475276/90988970-ccd46180-e54b-11ea-80c8-52f31be9dc41.png) ![Screen Shot 2020-08-23 at 14 19 13](https://user-images.githubusercontent.com/1475276/90988975-cf36bb80-e54b-11ea-9a67-568f2e74710f.png) ![Screen Shot 2020-08-23 at 14 40 11](https://user-images.githubusercontent.com/1475276/90989456-b5e33e80-e54e-11ea-8a15-1ddb895b9c50.png)One thing worth noting is that when setting the `log_level` to `DEBUG`, the requests don't make it to Nomad log:
``` root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:26 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:27 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:28 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:29 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:29 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:30 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # tail -f -n 200 /var/log/nomad.log Aug 24 02:51:19 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:19.008Z [DEBUG] http: request complete: method=GET path=/v1/metrics duration=2.919233ms Aug 24 02:51:20 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:20.736Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=1.570177ms Aug 24 02:51:24 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:24.115Z [DEBUG] http: request complete: method=GET path=/v1/metrics duration=4.058075ms Aug 24 02:51:25 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:25.809Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=156.63µs ```Nomad version:
This is our dev cluster (our production cluster does not have the same issues). They are identical in configuration besides the instance types and the dev payloads technically.
We looked through our Slack history and the first time it appeared was on August 3rd:
![Screen Shot 2020-08-23 at 14 28 06](https://user-images.githubusercontent.com/1475276/90989111-f2159f80-e54c-11ea-9921-86c6a486f9b2.png)Which is aligned with us bumping the cluster to `0.12.1` from `0.12.0` (could be coincidence or a v0.12.x issue but then again our production cluster does not have this issue):
![Screen Shot 2020-08-23 at 14 29 08](https://user-images.githubusercontent.com/1475276/90989282-0908c180-e54e-11ea-8670-f37677cb88bf.png)Now I imagine this must be hard to debug so please let us know where we should start.
Thanks in advance for the help!