hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.91k stars 1.95k forks source link

client node missed heartbeat and its task has transfered #8646

Closed Bingqiang-Jing closed 4 years ago

Bingqiang-Jing commented 4 years ago

Nomad version

v0.11.2

Operating system and Environment details

win10 server

Issue

I have 3 server nodes, 10.69.177.3, 10.69.180.4 and 10.69.176.12. when 177.3 has unexpected shutdown, the client node 10.69.176.8 received heartbeat missed event and its running task has transfered, but other client node is ok

Reproduction steps

Job file (if appropriate)

group "stitch-grp-xita-01" { count = 1

    restart {
        attempts = 2
        interval = "10s"
        delay = "1s"
        mode = "fail"
    }

    task "stitch-xita-01-exe" {
        env {
            "STITCH_INSTALL_DIR" = "D:\\Program Files\\iseetech\\stitch"
        }
        driver = "raw_exec"
        config {
            command = "${env["STITCH_INSTALL_DIR"]}\\StitchService\\StitchServiceProcess.exe"
            args = ["-p", "${NOMAD_PORT_http_server}", 
                    "-n", "xita-01-stitch",
                    "-s", "D:\\Program Files\\iseetech\\stitch\\setting1\\xita1-main;D:\\Program Files\\iseetech\\stitch\\setting1\\xita1-backup1;D:\\Program Files\\iseetech\\stitch\\setting1\\xita1-backup2"]

}

Nomad Client logs (if appropriate)

176.8

2020-07-31T16:48:18.612+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=0s 2020-07-31T16:48:28.615+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.0007ms 2020-07-31T16:48:38.617+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.0008ms 2020-07-31T16:48:48.620+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.0007ms 2020-07-31T16:48:58.621+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.0007ms 2020-07-31T16:49:04.169+0800 [ERROR] client: yamux: Failed to read header: read tcp 10.69.176.8:12118->10.69.177.3:4647: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. 2020-07-31T16:49:04.170+0800 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.69.177.3:4647 2020-07-31T16:49:04.170+0800 [ERROR] client: error querying node allocations: error="rpc error: EOF" 2020-07-31T16:49:04.170+0800 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=10.69.177.3:4647 2020-07-31T16:49:04.170+0800 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.518890993s 2020-07-31T16:49:04.171+0800 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=[dc1] 2020-07-31T16:49:04.175+0800 [INFO] client.consul: discovered following servers: servers=[10.69.180.4:4647, 10.69.177.3:4647, 10.69.176.12:4647] 2020-07-31T16:49:05.909+0800 [DEBUG] client: evaluations triggered by node update: num_evals=1 2020-07-31T16:49:05.909+0800 [DEBUG] client: state updated: node_status=ready 2020-07-31T16:49:05.910+0800 [WARN] client: missed heartbeat: req_latency=220.1642ms heartbeat_ttl=6.509786071s since_last_heartbeat=27.1512474s 2020-07-31T16:49:05.912+0800 [DEBUG] client: updated allocations: index=156250 total=1 pulled=1 filtered=0 2020-07-31T16:49:05.912+0800 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=0 2020-07-31T16:49:06.036+0800 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=0 errors=0 2020-07-31T16:49:06.118+0800 [DEBUG] consul.sync: sync complete: registered_services=0 deregistered_services=1 registered_checks=0 deregistered_checks=1 2020-07-31T16:49:06.486+0800 [DEBUG] client: updated allocations: index=156261 total=1 pulled=0 filtered=1 2020-07-31T16:49:06.487+0800 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1 2020-07-31T16:49:06.487+0800 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0 2020-07-31T16:49:06.733+0800 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 driver=raw_exec task_name=xita-dbce error="rpc error: code = Unavailable desc = transport is closing" 2020-07-31T16:49:06.734+0800 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce error="rpc error: code = Canceled desc = grpc: the client connection is closing" 2020-07-31T16:49:06.863+0800 [DEBUG] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 driver=raw_exec task_name=xita-dbce path="D:\Program Files\iseetech\Commander\HashiCorp\nomad.exe" pid=10376 2020-07-31T16:49:06.863+0800 [DEBUG] client.driver_mgr.raw_exec.executor: plugin exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 driver=raw_exec task_name=xita-dbce 2020-07-31T16:49:07.126+0800 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce path="D:\Program Files\iseetech\Commander\HashiCorp\nomad.exe" pid=10792 2020-07-31T16:49:07.126+0800 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce 2020-07-31T16:49:07.127+0800 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce 2020-07-31T16:49:07.132+0800 [DEBUG] client: updated allocations: index=156262 total=1 pulled=0 filtered=1 2020-07-31T16:49:07.132+0800 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1 2020-07-31T16:49:07.132+0800 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0 2020-07-31T16:49:07.321+0800 [DEBUG] client: updated allocations: index=156263 total=1 pulled=0 filtered=1 2020-07-31T16:49:07.321+0800 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1 2020-07-31T16:49:07.322+0800 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0 2020-07-31T16:49:08.283+0800 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 driver=raw_exec task_name=xita-dbce-rtspserver error="rpc error: code = Unavailable desc = transport is closing" 2020-07-31T16:49:08.284+0800 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce-rtspserver error="rpc error: code = Canceled desc = grpc: the client connection is closing" 2020-07-31T16:49:08.419+0800 [DEBUG] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 driver=raw_exec task_name=xita-dbce-rtspserver path="D:\Program Files\iseetech\Commander\HashiCorp\nomad.exe" pid=9212 2020-07-31T16:49:08.419+0800 [DEBUG] client.driver_mgr.raw_exec.executor: plugin exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 driver=raw_exec task_name=xita-dbce-rtspserver 2020-07-31T16:49:08.563+0800 [INFO] client.gc: marking allocation for GC: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 2020-07-31T16:49:08.623+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.0008ms 2020-07-31T16:49:08.695+0800 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce-rtspserver path="D:\Program Files\iseetech\Commander\HashiCorp\nomad.exe" pid=5972 2020-07-31T16:49:08.695+0800 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce-rtspserver 2020-07-31T16:49:08.696+0800 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=437a776f-cf80-c6b4-d1b4-9e819bd7ba84 task=xita-dbce-rtspserver 2020-07-31T16:49:08.808+0800 [DEBUG] client: updated allocations: index=156264 total=1 pulled=0 filtered=1 2020-07-31T16:49:08.808+0800 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1 2020-07-31T16:49:08.809+0800 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0 2020-07-31T16:49:09.145+0800 [DEBUG] client: updated allocations: index=156265 total=1 pulled=0 filtered=1 2020-07-31T16:49:09.145+0800 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1 2020-07-31T16:49:09.146+0800 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0 2020-07-31T16:49:18.626+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=0s 2020-07-31T16:49:25.846+0800 [DEBUG] client.server_mgr: new server list: new_servers=[10.69.176.12:4647, 10.69.180.4:4647] old_servers=[10.69.176.12:4647, 10.69.180.4:4647, 10.69.177.3:4647] 2020-07-31T16:49:28.629+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=2.0015ms 2020-07-31T16:49:38.633+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.0008ms 2020-07-31T16:49:48.635+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=0s 2020-07-31T16:49:58.638+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=0s

Nomad Server logs (if appropriate)

180.4

2020-07-31T16:48:31.620+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:48:36.672+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52421 2020-07-31T16:48:41.622+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0013ms 2020-07-31T16:48:43.130+0800 [WARN] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=500.373ms 2020-07-31T16:48:43.626+0800 [WARN] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=995.7413ms 2020-07-31T16:48:44.117+0800 [WARN] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=1.4871072s 2020-07-31T16:48:44.523+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=1.8934097s 2020-07-31T16:48:44.980+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=2.3497494s 2020-07-31T16:48:45.461+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=2.8311078s 2020-07-31T16:48:45.872+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=3.242414s 2020-07-31T16:48:46.364+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=3.7337799s 2020-07-31T16:48:46.672+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52441 2020-07-31T16:48:46.859+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=4.2291487s 2020-07-31T16:48:47.380+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=4.7495361s 2020-07-31T16:48:47.834+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=5.2038744s 2020-07-31T16:48:48.312+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=5.6822305s 2020-07-31T16:48:48.748+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=6.1175546s 2020-07-31T16:48:49.188+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=6.5578825s 2020-07-31T16:48:49.662+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=7.0322356s 2020-07-31T16:48:50.139+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=7.5085908s 2020-07-31T16:48:50.635+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=8.0049599s 2020-07-31T16:48:51.133+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=8.5033309s 2020-07-31T16:48:51.610+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=8.9796856s 2020-07-31T16:48:51.625+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:48:52.068+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=9.4380268s 2020-07-31T16:48:52.543+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=9.9133807s 2020-07-31T16:48:52.691+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="msgpack decode error [pos 41038706]: read tcp 10.69.180.4:61152->10.69.177.3:4647: i/o timeout" 2020-07-31T16:48:52.697+0800 [INFO] nomad.raft: aborting pipeline replication: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" 2020-07-31T16:48:52.880+0800 [DEBUG] nomad: memberlist: Initiating push/pull sync with: 10.69.176.12:4648 2020-07-31T16:48:53.197+0800 [WARN] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=500.3725ms 2020-07-31T16:48:53.653+0800 [WARN] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=956.7123ms 2020-07-31T16:48:54.138+0800 [WARN] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=1.4410729s 2020-07-31T16:48:54.230+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:48:54.564+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=1.8673903s 2020-07-31T16:48:55.042+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=2.3457465s 2020-07-31T16:48:55.506+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=2.8090919s 2020-07-31T16:48:55.961+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=3.2644305s 2020-07-31T16:48:56.230+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:48:56.379+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=3.6827419s 2020-07-31T16:48:56.673+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52461 2020-07-31T16:48:56.840+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=4.1430846s 2020-07-31T16:48:57.319+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=4.6224415s 2020-07-31T16:48:57.808+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=5.1118059s 2020-07-31T16:48:58.293+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=5.5961669s 2020-07-31T16:48:58.753+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=6.0565092s 2020-07-31T16:48:59.212+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=6.5158512s 2020-07-31T16:48:59.710+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=7.0132215s 2020-07-31T16:49:00.150+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=7.4535493s 2020-07-31T16:49:00.268+0800 [WARN] nomad.heartbeat: node TTL expired: node_id=6448fbd6-c38d-040d-41ed-a013052b3aaf 2020-07-31T16:49:00.632+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=7.9359089s 2020-07-31T16:49:00.920+0800 [DEBUG] nomad: memberlist: Stream connection from=10.69.176.12:12644 2020-07-31T16:49:01.093+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=8.3962512s 2020-07-31T16:49:01.551+0800 [ERROR] nomad.rpc: yamux: Failed to read header: read tcp 10.69.180.4:4647->10.69.177.3:65498: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. 2020-07-31T16:49:01.552+0800 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="read tcp 10.69.180.4:4647->10.69.177.3:65498: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." 2020-07-31T16:49:01.573+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=8.8766088s 2020-07-31T16:49:01.626+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:02.038+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=9.3409546s 2020-07-31T16:49:02.474+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=9.7772799s 2020-07-31T16:49:02.793+0800 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:02.852+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:02.891+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=10.1945901s 2020-07-31T16:49:03.380+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=10.6829537s 2020-07-31T16:49:03.832+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=11.1352905s 2020-07-31T16:49:04.010+0800 [WARN] nomad.heartbeat: node TTL expired: node_id=6df2ade6-67a6-61a4-d19b-aba2fc8fa217 2020-07-31T16:49:04.230+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:49:04.264+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=11.5676124s 2020-07-31T16:49:04.725+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=12.0279551s 2020-07-31T16:49:05.162+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=12.4652807s 2020-07-31T16:49:05.616+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=12.919619s 2020-07-31T16:49:06.057+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=13.3599468s 2020-07-31T16:49:06.122+0800 [WARN] nomad.heartbeat: node TTL expired: node_id=fb18b32e-64a5-1117-8fae-561b05b3b527 2020-07-31T16:49:06.231+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:49:06.541+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=13.8443074s 2020-07-31T16:49:06.673+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52505 2020-07-31T16:49:07.039+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=14.3426785s 2020-07-31T16:49:07.421+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=14.7239623s 2020-07-31T16:49:07.920+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=15.2233341s 2020-07-31T16:49:08.105+0800 [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached 2020-07-31T16:49:08.105+0800 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout" 2020-07-31T16:49:08.395+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=15.698688s 2020-07-31T16:49:08.843+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=16.1460211s 2020-07-31T16:49:09.307+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=16.6103668s 2020-07-31T16:49:09.742+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=17.0456909s 2020-07-31T16:49:10.214+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=17.5170418s 2020-07-31T16:49:10.703+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=18.0064062s 2020-07-31T16:49:11.162+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=18.4657482s 2020-07-31T16:49:11.629+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:49:11.651+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=18.9541118s 2020-07-31T16:49:12.141+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=19.4444769s 2020-07-31T16:49:12.633+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=19.9368434s 2020-07-31T16:49:12.804+0800 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:13.002+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:13.114+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=20.4172015s 2020-07-31T16:49:13.593+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=20.896558s 2020-07-31T16:49:14.036+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=21.339888s 2020-07-31T16:49:14.230+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:49:14.460+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=21.7632032s 2020-07-31T16:49:14.909+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=22.2125377s 2020-07-31T16:49:15.389+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=22.6928954s 2020-07-31T16:49:15.867+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=23.1702508s 2020-07-31T16:49:16.230+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:49:16.359+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=23.6626174s 2020-07-31T16:49:16.673+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52531 2020-07-31T16:49:16.824+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=24.1269631s 2020-07-31T16:49:17.300+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=24.6033177s 2020-07-31T16:49:17.799+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=25.1026895s 2020-07-31T16:49:18.264+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=25.5670352s 2020-07-31T16:49:18.728+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=26.0313809s 2020-07-31T16:49:19.198+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=26.5017311s 2020-07-31T16:49:19.231+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:49:19.647+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=26.9500649s 2020-07-31T16:49:20.096+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=27.3993995s 2020-07-31T16:49:20.573+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=27.8767549s 2020-07-31T16:49:21.047+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=28.3501073s 2020-07-31T16:49:21.518+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=28.8214578s 2020-07-31T16:49:21.631+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:22.013+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=29.316827s 2020-07-31T16:49:22.469+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=29.772166s 2020-07-31T16:49:22.573+0800 [INFO] nomad: memberlist: Marking xq-dt-decoding1-m.global as failed, suspect timeout reached (0 peer confirmations) 2020-07-31T16:49:22.573+0800 [INFO] nomad: serf: EventMemberFailed: xq-dt-decoding1-m.global 10.69.177.3 2020-07-31T16:49:22.574+0800 [INFO] nomad: removing server: server="xq-dt-decoding1-m.global (Addr: 10.69.177.3:4647) (DC: dc1)" 2020-07-31T16:49:22.814+0800 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:22.934+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=30.2375125s 2020-07-31T16:49:23.135+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:23.407+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=30.7108649s 2020-07-31T16:49:23.879+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=31.1822159s 2020-07-31T16:49:24.372+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=31.6755832s 2020-07-31T16:49:24.848+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=32.1509371s 2020-07-31T16:49:25.309+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=32.612281s 2020-07-31T16:49:25.777+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=33.0806293s 2020-07-31T16:49:26.224+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=33.5269616s 2020-07-31T16:49:26.231+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:49:26.674+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52563 2020-07-31T16:49:26.679+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=33.9823006s 2020-07-31T16:49:27.119+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=34.4226284s 2020-07-31T16:49:27.607+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=34.9099913s 2020-07-31T16:49:28.055+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=35.3583251s 2020-07-31T16:49:28.524+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=35.8276745s 2020-07-31T16:49:29.022+0800 [DEBUG] nomad.raft: failed to contact: server-id=10.69.177.3:4647 time=36.3250448s 2020-07-31T16:49:29.186+0800 [INFO] nomad.autopilot: Attempting removal of failed server node: name=xq-dt-decoding1-m.global 2020-07-31T16:49:29.186+0800 [INFO] nomad: serf: EventMemberLeave (forced): xq-dt-decoding1-m.global 10.69.177.3 2020-07-31T16:49:29.187+0800 [INFO] nomad: removing server: server="xq-dt-decoding1-m.global (Addr: 10.69.177.3:4647) (DC: dc1)" 2020-07-31T16:49:29.187+0800 [INFO] nomad: removing server by address: address=10.69.177.3:4647 2020-07-31T16:49:29.187+0800 [INFO] nomad.raft: updating configuration: command=RemoveServer server-id=10.69.177.3:4647 server-addr= servers="[{Suffrage:Voter ID:10.69.180.4:4647 Address:10.69.180.4:4647} {Suffrage:Voter ID:10.69.176.12:4647 Address:10.69.176.12:4647}]" 2020-07-31T16:49:29.326+0800 [INFO] nomad.raft: removed peer, stopping replication: peer=10.69.177.3:4647 last-index=156268 2020-07-31T16:49:29.570+0800 [DEBUG] nomad: serf: messageLeaveType: xq-dt-decoding1-m.global 2020-07-31T16:49:30.070+0800 [DEBUG] nomad: serf: messageLeaveType: xq-dt-decoding1-m.global 2020-07-31T16:49:31.634+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:49:32.920+0800 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:33.324+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:36.674+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52591 2020-07-31T16:49:41.636+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:42.960+0800 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:43.560+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:46.675+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52614 2020-07-31T16:49:51.638+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0012ms 2020-07-31T16:49:52.882+0800 [DEBUG] nomad: memberlist: Initiating push/pull sync with: 10.69.176.12:4648 2020-07-31T16:49:53.041+0800 [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.69.177.3:4647 10.69.177.3:4647}" error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:53.876+0800 [ERROR] nomad.raft: failed to heartbeat to: peer=10.69.177.3:4647 error="dial tcp 10.69.177.3:4647: i/o timeout" 2020-07-31T16:49:56.675+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:52643 2020-07-31T16:50:00.921+0800 [DEBUG] nomad: memberlist: Stream connection from=10.69.176.12:12890

176.12

2020-07-31T16:48:52.572+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:48:52.881+0800 [DEBUG] nomad: memberlist: Stream connection from=10.69.180.4:52453 2020-07-31T16:48:56.453+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:00.675+0800 [DEBUG] worker: dequeued evaluation: eval_id=c441766f-5940-239f-2ecc-838b8705618e 2020-07-31T16:49:00.695+0800 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=c441766f-5940-239f-2ecc-838b8705618e job_id=stitch namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 1) Desired Changes for "a": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-dge": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "t3c-f": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-b-g": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-dbce": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0) Desired Changes for "xita-e": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-d": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-c": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)" 2020-07-31T16:49:00.837+0800 [DEBUG] worker: submitted plan for evaluation: eval_id=c441766f-5940-239f-2ecc-838b8705618e 2020-07-31T16:49:00.838+0800 [DEBUG] worker.service_sched: setting eval status: eval_id=c441766f-5940-239f-2ecc-838b8705618e job_id=stitch namespace=default status=complete 2020-07-31T16:49:00.917+0800 [DEBUG] worker: updated evaluation: eval="<Eval "c441766f-5940-239f-2ecc-838b8705618e" JobID: "stitch" Namespace: "default">" 2020-07-31T16:49:00.918+0800 [DEBUG] worker: ack evaluation: eval_id=c441766f-5940-239f-2ecc-838b8705618e 2020-07-31T16:49:00.920+0800 [DEBUG] nomad: memberlist: Initiating push/pull sync with: 10.69.180.4:4648 2020-07-31T16:49:01.890+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12645 2020-07-31T16:49:05.572+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:49:05.911+0800 [DEBUG] worker: dequeued evaluation: eval_id=98734da6-273c-5d92-7d53-6fb9fc8546d1 2020-07-31T16:49:06.251+0800 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=98734da6-273c-5d92-7d53-6fb9fc8546d1 job_id=stitch namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 0) Desired Changes for "a": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-dge": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "t3c-f": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-b-g": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-dbce": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-e": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-d": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0) Desired Changes for "xita-c": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)" 2020-07-31T16:49:06.253+0800 [DEBUG] worker.service_sched: setting eval status: eval_id=98734da6-273c-5d92-7d53-6fb9fc8546d1 job_id=stitch namespace=default status=complete 2020-07-31T16:49:06.316+0800 [DEBUG] worker: updated evaluation: eval="<Eval "98734da6-273c-5d92-7d53-6fb9fc8546d1" JobID: "stitch" Namespace: "default">" 2020-07-31T16:49:06.317+0800 [DEBUG] worker: ack evaluation: eval_id=98734da6-273c-5d92-7d53-6fb9fc8546d1 2020-07-31T16:49:06.455+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:07.572+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:49:08.075+0800 [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached 2020-07-31T16:49:08.075+0800 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout" 2020-07-31T16:49:11.890+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12690 2020-07-31T16:49:15.572+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:49:16.458+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:17.572+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:49:21.891+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12725 2020-07-31T16:49:22.573+0800 [INFO] nomad: memberlist: Marking xq-dt-decoding1-m.global as failed, suspect timeout reached (0 peer confirmations) 2020-07-31T16:49:22.573+0800 [INFO] nomad: serf: EventMemberFailed: xq-dt-decoding1-m.global 10.69.177.3 2020-07-31T16:49:22.574+0800 [INFO] nomad: removing server: server="xq-dt-decoding1-m.global (Addr: 10.69.177.3:4647) (DC: dc1)" 2020-07-31T16:49:25.571+0800 [DEBUG] nomad: memberlist: Failed ping: xq-dt-decoding1-m.global (timeout reached) 2020-07-31T16:49:26.460+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:49:29.231+0800 [DEBUG] nomad: serf: messageLeaveType: xq-dt-decoding1-m.global 2020-07-31T16:49:29.231+0800 [INFO] nomad: serf: EventMemberLeave (forced): xq-dt-decoding1-m.global 10.69.177.3 2020-07-31T16:49:29.232+0800 [INFO] nomad: removing server: server="xq-dt-decoding1-m.global (Addr: 10.69.177.3:4647) (DC: dc1)" 2020-07-31T16:49:29.730+0800 [DEBUG] nomad: serf: messageLeaveType: xq-dt-decoding1-m.global 2020-07-31T16:49:31.891+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12768 2020-07-31T16:49:32.572+0800 [INFO] nomad: memberlist: Suspect xq-dt-decoding1-m.global has failed, no acks received 2020-07-31T16:49:36.462+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:49:41.891+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12809 2020-07-31T16:49:46.465+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:49:51.892+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12840 2020-07-31T16:49:52.885+0800 [DEBUG] nomad: memberlist: Stream connection from=10.69.180.4:52635 2020-07-31T16:49:56.467+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0021ms 2020-07-31T16:50:00.923+0800 [DEBUG] nomad: memberlist: Initiating push/pull sync with: 10.69.180.4:4648 2020-07-31T16:50:01.892+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12894 2020-07-31T16:50:06.470+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:50:11.893+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12940 2020-07-31T16:50:16.472+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:50:21.893+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:12971 2020-07-31T16:50:26.475+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:50:31.894+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13010 2020-07-31T16:50:36.477+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:50:41.894+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13043 2020-07-31T16:50:46.480+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:50:51.895+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13078 2020-07-31T16:50:52.888+0800 [DEBUG] nomad: memberlist: Stream connection from=10.69.180.4:52774 2020-07-31T16:50:56.482+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:51:00.925+0800 [DEBUG] nomad: memberlist: Initiating push/pull sync with: 10.69.180.4:4648 2020-07-31T16:51:01.895+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13113 2020-07-31T16:51:06.484+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:51:11.895+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13147 2020-07-31T16:51:16.487+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:51:21.896+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13178 2020-07-31T16:51:26.489+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:51:31.896+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13223 2020-07-31T16:51:36.492+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0008ms 2020-07-31T16:51:39.189+0800 [DEBUG] worker: dequeued evaluation: eval_id=91505e8c-8bfa-410f-42b1-8762704c5a8f 2020-07-31T16:51:39.189+0800 [DEBUG] worker: dequeued evaluation: eval_id=9e1efca7-dd66-c873-505b-e65d2239b5a0 2020-07-31T16:51:39.189+0800 [DEBUG] worker: dequeued evaluation: eval_id=0598955a-8484-96b0-ebe5-765e223c7bab 2020-07-31T16:51:39.189+0800 [DEBUG] worker: dequeued evaluation: eval_id=1d515814-3fcd-9ff9-a274-fbb37bbf88e5 2020-07-31T16:51:39.189+0800 [DEBUG] worker: dequeued evaluation: eval_id=514f9fa8-759b-fb06-2263-d0153eb1abf4 2020-07-31T16:51:39.189+0800 [DEBUG] core.sched: node GC scanning before cutoff index: index=154672 node_gc_threshold=24h0m0s 2020-07-31T16:51:39.190+0800 [DEBUG] core.sched: CSI volume claim GC scanning before cutoff index: index=155847 csi_volume_claim_gc_threshold=1h0m0s 2020-07-31T16:51:39.190+0800 [DEBUG] core.sched: job GC scanning before cutoff index: index=155847 job_gc_threshold=4h0m0s 2020-07-31T16:51:39.190+0800 [DEBUG] core.sched: eval GC scanning before cutoff index: index=155847 eval_gc_threshold=1h0m0s 2020-07-31T16:51:39.191+0800 [DEBUG] worker: ack evaluation: eval_id=91505e8c-8bfa-410f-42b1-8762704c5a8f 2020-07-31T16:51:39.190+0800 [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=155847 csi_plugin_gc_threshold=1h0m0s 2020-07-31T16:51:39.191+0800 [DEBUG] worker: ack evaluation: eval_id=9e1efca7-dd66-c873-505b-e65d2239b5a0 2020-07-31T16:51:39.191+0800 [DEBUG] worker: ack evaluation: eval_id=0598955a-8484-96b0-ebe5-765e223c7bab 2020-07-31T16:51:39.191+0800 [DEBUG] worker: ack evaluation: eval_id=1d515814-3fcd-9ff9-a274-fbb37bbf88e5 2020-07-31T16:51:39.191+0800 [DEBUG] worker: ack evaluation: eval_id=514f9fa8-759b-fb06-2263-d0153eb1abf4 2020-07-31T16:51:41.897+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13258 2020-07-31T16:51:46.494+0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=1.0007ms 2020-07-31T16:51:51.897+0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:13288 2020-07-31T16:51:52.891+0800 [DEBUG] nomad: memberlist: Stream connection from=10.69.180.4:52912

tgross commented 4 years ago

Hi @Bingqiang-Jing! It looks like this behavior is what we'd expect to see.

The client connects to a specific server when it registers and maintains that connection. It will only retry a request to another server on "read only" RPCs, which doesn't include updating the node's health status for heartbeats. So when the client couldn't reach the failed server, it would not retry another server. Then when the other servers saw that the client had failed heartbeats, they rescheduled the workloads on another client.

Settler commented 3 years ago

@tgross Hi!

Could you please extend your answer. We have similar behavior and I want to clarify client-server communication logic. As I understand, when client node selects the server for communication, it keeps connection to that server. If server becomes unavailable, node won't reconnect to other available server nodes. It will 100% become lost and rejoins to cluster? So there is no way to avoid allocation rescheduling in such behavior? Node won't reconnect to another available server (if the leader is still available)? Maybe nomad has some settings to adjust that behavior?

tgross commented 3 years ago

@Settler my comment there could be more clear, but I'm specifically referring to retries of an RPC. The client would need to heartbeat again on its next internal, which can make it miss the heartbeat timeout on the server.

But if you have more questions about this, please open a new issue or better yet post to Discuss

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.