hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

Connection reset by peer when querying HTTP endpoints #8718

Open scalp42 opened 4 years ago

scalp42 commented 4 years ago

Hi folks,

We're seeing errors every day when trying to query (by hand or using nomad cli) any kind of HTTP endpoint really, being /v1/agent/self or /v1/acl/tokens for example.

Here's an example on the actual clients: ``` root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:19 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:19 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:20 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:20 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0316e38d5be06a803 [dev-usw2-dev1] ~ # date ; curl localhost:4646/v1/agent/self Sun Aug 23 21:02:21 UTC 2020 {"config":{"ACL":{"Enabled":true,"PolicyTTL":30000000000,"ReplicationToken":"","TokenTTL":30000000000},"Addresses":{"HTTP":"0.0.0.0","RPC":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com","Serf":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com"},"AdvertiseAddrs":{"HTTP":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com:4646","RPC":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com:4647","Serf":"nomad-compute-i-0316e38d5be06a803.usw2-dev1.example.com"},"Audit":{"Enabled":null,"Filters":null,"Sinks":null},"Autopilot":{"CleanupDeadServers":null,"DisableUpgradeMigration":null,"EnableCustomUpgrades":null,"EnableRedundancyZones":null,"LastContactThreshold":200000000,"MaxTrailingLogs":250,"MinQuorum":0,"ServerStabilizationTime":10000000000},"BindAddr":"0.0.0.0","Client":{"AllocDir":"","BindWildcardDefaultHostNetwork":true,"BridgeNetworkName":"","BridgeNetworkSubnet":"","CNIConfigDir":"","CNIPath":"","ChrootEnv":{},"ClientMaxPort":14512,"ClientMinPort":14000,"CpuCompute":0,"DisableRemoteExec":false,"Enabled":true,"GCDiskUsageThreshold":80.0,"GCInodeUsageThreshold":70.0,"GCInterval":60000000000,"GCMaxAllocs":50,"GCParallelDestroys":2,"HostNetworks":null,"HostVolumes":null,"MaxKillTimeout":"30s","MemoryMB":0,"Meta":{"chef_role":"nomad-compute","role":"nomad-compute","connect.sidecar_image":"envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09","connect.log_level":"info"},"NetworkInterface":"","NetworkSpeed":0,"NoHostUUID":false,"NodeClass":"compute","Options":{},"Reserved":{"CPU":440,"DiskMB":0,"MemoryMB":794,"ReservedPorts":""},"ServerJoin":{"RetryInterval":30000000000,"RetryJoin":["provider=aws tag_key=role tag_value=nomad-server region=us-west-2 addr_type=private_v4"],"RetryMaxAttempts":0,"StartJoin":null},"Servers":null,"StateDir":"","TemplateConfig":{"DisableSandbox":false,"FunctionBlacklist":["plugin"]}},"Consul":{"Addr":"127.0.0.1:8500","AllowUnauthenticated":true,"Auth":"","AutoAdvertise":true,"CAFile":"","CertFile":"","ChecksUseAdvertise":true,"ClientAutoJoin":true,"ClientHTTPCheckName":"Nomad Client HTTP Check","ClientServiceName":"nomad-client","EnableSSL":false,"GRPCAddr":"","KeyFile":"","ServerAutoJoin":true,"ServerHTTPCheckName":"Nomad Server HTTP Check","ServerRPCCheckName":"Nomad Server RPC Check","ServerSerfCheckName":"Nomad Server Serf Check","ServerServiceName":"nomad","ShareSSL":null,"Tags":null,"Timeout":5000000000,"Token":"","VerifySSL":true},"DataDir":"/var/lib/nomad","Datacenter":"dev-usw2-dev1","DevMode":false,"DisableAnonymousSignature":false,"DisableUpdateCheck":false,"EnableDebug":false,"EnableSyslog":false,"Files":["/etc/nomad/acl.json","/etc/nomad/client.json","/etc/nomad/consul.json","/etc/nomad/default.json","/etc/nomad/docker.json","/etc/nomad/exec.json","/etc/nomad/logging.json","/etc/nomad/raw_exec.json","/etc/nomad/telemetry.json"],"HTTPAPIResponseHeaders":{"Access-Control-Allow-Origin":"*"},"LeaveOnInt":false,"LeaveOnTerm":false,"Limits":{"HTTPMaxConnsPerClient":100,"HTTPSHandshakeTimeout":"5s","RPCHandshakeTimeout":"5s","RPCMaxConnsPerClient":100},"LogFile":"","LogJson":false,"LogLevel":"INFO","LogRotateBytes":0,"LogRotateDuration":"","LogRotateMaxFiles":0,"NodeName":"nomad-compute-i-0316e38d5be06a803","PluginDir":"/var/lib/nomad/plugins","Plugins":[{"Args":null,"Config":{"allow_caps":["CHOWN","DAC_OVERRIDE","FSETID","FOWNER","MKNOD","NET_RAW","SETGID","SETUID","SETFCAP","SETPCAP","NET_BIND_SERVICE","SYS_CHROOT","KILL","AUDIT_WRITE"],"auth":[{"helper":"","config":"/root/.docker/config.json"}],"tls":[{"ca":"","cert":"","key":""}],"gc":[{"image":true,"image_delay":"168h","container":true}],"volumes":[{"enabled":true,"selinuxlabel":""}],"endpoint":"","allow_privileged":false},"Name":"docker"},{"Args":null,"Config":{},"Name":"exec"},{"Args":null,"Config":{"enabled":true},"Name":"raw_exec"}],"Ports":{"HTTP":4646,"RPC":4647,"Serf":4648},"Region":"us-west-2","Sentinel":{"Imports":null},"Server":{"AuthoritativeRegion":"","BootstrapExpect":0,"CSIPluginGCThreshold":"","CSIVolumeClaimGCThreshold":"","DataDir":"","DefaultSchedulerConfig":null,"DeploymentGCThreshold":"","Enabled":false,"EnabledSchedulers":null,"EvalGCThreshold":"","HeartbeatGrace":0,"JobGCInterval":"","JobGCThreshold":"","MaxHeartbeatsPerSecond":0.0,"MinHeartbeatTTL":0,"NodeGCThreshold":"","NonVotingServer":false,"NumSchedulers":null,"ProtocolVersion":0,"RaftMultiplier":null,"RaftProtocol":0,"RedundancyZone":"","RejoinAfterLeave":false,"RetryInterval":0,"RetryJoin":[],"RetryMaxAttempts":0,"ServerJoin":{"RetryInterval":30000000000,"RetryJoin":[],"RetryMaxAttempts":0,"StartJoin":null},"StartJoin":[],"UpgradeVersion":""},"SyslogFacility":"LOCAL0","TLSConfig":{"CAFile":"","CertFile":"","Checksum":"","EnableHTTP":false,"EnableRPC":false,"KeyFile":"","KeyLoader":null,"RPCUpgradeMode":false,"TLSCipherSuites":"","TLSMinVersion":"","TLSPreferServerCipherSuites":false,"VerifyHTTPSClient":false,"VerifyServerHostname":false},"Telemetry":{"BackwardsCompatibleMetrics":false,"CirconusAPIApp":"","CirconusAPIToken":"","CirconusAPIURL":"","CirconusBrokerID":"","CirconusBrokerSelectTag":"","CirconusCheckDisplayName":"","CirconusCheckForceMetricActivation":"","CirconusCheckID":"","CirconusCheckInstanceID":"","CirconusCheckSearchTag":"","CirconusCheckSubmissionURL":"","CirconusCheckTags":"","CirconusSubmissionInterval":"","CollectionInterval":"10s","DataDogAddr":"","DataDogTags":[],"DisableDispatchedJobSummaryMetrics":true,"DisableHostname":true,"DisableTaggedMetrics":false,"FilterDefault":false,"PrefixFilter":["+nomad.client.consul","+nomad.client.allocs","+nomad.client.allocated.cpu","+nomad.client.allocated.disk","+nomad.client.allocated.iops","+nomad.client.allocated.memory","+nomad.client.allocations.blocked","+nomad.client.allocations.migrating","+nomad.client.allocations.pending","+nomad.client.allocations.running","+nomad.client.allocations.terminal","+nomad.client.host.cpu.idle","+nomad.client.host.cpu.system","+nomad.client.host.cpu.total","+nomad.client.host.cpu.user","+nomad.client.host.disk.available","+nomad.client.host.disk.inodes_percent","+nomad.client.host.disk.size","+nomad.client.host.disk.used","+nomad.client.host.disk.used_percent","+nomad.client.host.memory.available","+nomad.client.host.memory.free","+nomad.client.host.memory.total","+nomad.client.host.memory.used","+nomad.client.unallocated.cpu","+nomad.client.unallocated.disk","+nomad.client.unallocated.iops","+nomad.client.unallocated.memory","+nomad.nomad.blocked_evals.total_blocked","+nomad.nomad.blocked_evals.total_escaped","+nomad.nomad.blocked_evals.total_quota_limit","+nomad.nomad.broker._core.ready","+nomad.nomad.broker._core.unacked","+nomad.nomad.broker.total_blocked","+nomad.nomad.broker.total_ready","+nomad.nomad.broker.total_unacked","+nomad.nomad.broker.total_waiting","+nomad.nomad.heartbeat.active","+nomad.nomad.plan.queue_depth","+nomad.nomad.vault.distributed_tokens_revoking","+nomad.runtime.alloc_bytes","+nomad.runtime.free_count","+nomad.runtime.heap_objects","+nomad.runtime.malloc_count","+nomad.runtime.num_goroutines","+nomad.runtime.sys_bytes","+nomad.runtime.total_gc_pause_ns","+nomad.runtime.total_gc_runs","+nomad.uptime"],"PrometheusMetrics":true,"PublishAllocationMetrics":true,"PublishNodeMetrics":true,"StatsdAddr":"","StatsiteAddr":"","UseNodeName":false},"Vault":{"Addr":"https://vault.service.consul:8200","AllowUnauthenticated":true,"ConnectionRetryIntv":30000000000,"Enabled":null,"Namespace":"","Role":"","TLSCaFile":"","TLSCaPath":"","TLSCertFile":"","TLSKeyFile":"","TLSServerName":"","TLSSkipVerify":null,"TaskTokenTTL":"","Token":""},"Version":{"Revision":"ee69b3379aeced67e14943b86c4f621451e64e84","Version":"0.12.2","VersionMetadata":"","VersionPrerelease":""}},"member":{"Addr":null,"DelegateCur":0,"DelegateMax":0,"DelegateMin":0,"Name":"","Port":0,"ProtocolCur":0,"ProtocolMax":0,"ProtocolMin":0,"Status":"none","Tags":null},"stats":{"client":{"num_allocations":"42","last_heartbeat":"9.815527721s","heartbeat_ttl":"16.745483114s","node_id":"d858acd1-8f02-9683-a965-a86bda7808d9","known_servers":"10.22.206.108:4647,10.22.210.59:4647,10.22.228.130:4647"},"runtime":{"kernel.name":"linux","arch":"amd64","version":"go1.14.7","max_procs":"4","goroutines":"1975","cpu_count":"4"}}} ```
We use Sensu and run InSpec checks as well to monitor certain endpoints and it's flapping all the time: ![Screen Shot 2020-08-23 at 14 17 40](https://user-images.githubusercontent.com/1475276/90988970-ccd46180-e54b-11ea-80c8-52f31be9dc41.png) ![Screen Shot 2020-08-23 at 14 19 13](https://user-images.githubusercontent.com/1475276/90988975-cf36bb80-e54b-11ea-9a67-568f2e74710f.png) ![Screen Shot 2020-08-23 at 14 40 11](https://user-images.githubusercontent.com/1475276/90989456-b5e33e80-e54e-11ea-8a15-1ddb895b9c50.png)
One thing worth noting is that when setting the `log_level` to `DEBUG`, the requests don't make it to Nomad log: ``` root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:26 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:27 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:28 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:29 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:29 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # date; curl 127.0.0.1:4646/v1/agent/self Mon Aug 24 02:51:30 UTC 2020 curl: (56) Recv failure: Connection reset by peer root@nomad-compute-i-0cdc8320aa6b1b1aa [dev-usw2-dev1] ~ # tail -f -n 200 /var/log/nomad.log Aug 24 02:51:19 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:19.008Z [DEBUG] http: request complete: method=GET path=/v1/metrics duration=2.919233ms Aug 24 02:51:20 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:20.736Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=1.570177ms Aug 24 02:51:24 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:24.115Z [DEBUG] http: request complete: method=GET path=/v1/metrics duration=4.058075ms Aug 24 02:51:25 nomad-compute-i-0cdc8320aa6b1b1aa nomad[11090]: 2020-08-24T02:51:25.809Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=156.63µs ```

Nomad version:

Nomad v0.12.2 (ee69b3379aeced67e14943b86c4f621451e64e84)

This is our dev cluster (our production cluster does not have the same issues). They are identical in configuration besides the instance types and the dev payloads technically.

We looked through our Slack history and the first time it appeared was on August 3rd: ![Screen Shot 2020-08-23 at 14 28 06](https://user-images.githubusercontent.com/1475276/90989111-f2159f80-e54c-11ea-9921-86c6a486f9b2.png)
Which is aligned with us bumping the cluster to `0.12.1` from `0.12.0` (could be coincidence or a v0.12.x issue but then again our production cluster does not have this issue): ![Screen Shot 2020-08-23 at 14 29 08](https://user-images.githubusercontent.com/1475276/90989282-0908c180-e54e-11ea-8670-f37677cb88bf.png)

Now I imagine this must be hard to debug so please let us know where we should start.

Thanks in advance for the help!

brbva commented 4 years ago

im experiencing the same on Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)

porshkevich commented 4 years ago

Confirming the problem on Nomad v0.11.3. And sometimes requests ends with empty response (EOF error).

scalp42 commented 4 years ago

Yes, we get the EOF all the time as well:

Screen Shot 2020-09-10 at 14 34 08

alienvspredator commented 4 years ago

By default, nomad is bound to 127.0.0.1. Thus, you can only connect to the API from this address. It can be reconfigured:

nomad agent -bind 0.0.0.0
scalp42 commented 4 years ago

By default, nomad is bound to 127.0.0.1. Thus, you can only connect to the API from this address. It can be reconfigured:


nomad agent -bind 0.0.0.0

Unfortunately, this has nothing to do with the issue.

I can confirm the API can be reached on that interface (you would get a connection denied otherwise) and like I said it works fine most of the time.

stevevandermerwe commented 3 years ago

I seem to have a similar issue. I almost exclusively use the API to interact with Nomad and occasionally it just stops responding (I have to restart Nomad to get it working again). I then get the connection reset by peer error.

I am running version v0.12.9

scalp42 commented 3 years ago

Still happening on latest Nomad version unfortunately 😢