Open jorgemarey opened 4 years ago
hi @jorgemarey , there were a number of changes in 0.9.6 (released yesterday) around client and allocations, but I don't have any good reason to believe that they will address the issue that you're seeing. I'm going to look at your report and see if I can reproduce it, but in the mean time, if you are able to try out 0.9.6, I would be curious to see whether it also exhibits this behavior.
Hi @cgbaker , thanks for the reply. I'll change the nomad version and see if this happens again.
Just to add more info. I took a look at the logs at the moment when the allocation in the image ( e6ee3260-a1c9-c283-ad17-d1cb6950774d
) gets the first kill signal.
Oct 9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T16:00:37.433289Z","added":0,"ignored":73,"removed":0,"updated":13}
Oct 9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-10-09T16:00:37.479736Z","added":0,"errors":0,"ignored":73,"removed":0,"updated":13}
Oct 9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T16:00:37.703560Z","filtered":74,"index":29722784,"pulled":12,"total":86}
Oct 9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T16:00:37.703838Z","added":0,"ignored":74,"removed":0,"updated":12}
Oct 9 16:00:37 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-10-09T16:00:37.756186Z","added":0,"errors":0,"ignored":74,"removed":0,"updated":12}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"error fetching stats of task","@module":"client.alloc_runner.task_runner.task_hook.stats_hook","@timestamp":"2019-10-09T16:00:39.442363Z","alloc_id":"e6ee3260-a1c9-c283-ad17-d1cb6950774d","error":"container stopped","task":"pdp"}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2019-10-09T16:00:39.446006Z","driver":"docker","path":"/opt/nomad/bin/nomad","pid":22377}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2019-10-09T16:00:39.446093Z","driver":"docker"}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2019-10-09T16:00:39.495588Z","driver":"docker","image_id":"sha256:cb101cb1872bdda9104a0e39c3822af6dd939c44ab18ed4996dcf5250668ed59","references":0}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-10-09T16:00:39.501933Z","alloc_id":"e6ee3260-a1c9-c283-ad17-d1cb6950774d"}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"updated allocations","@module":"client","@timestamp":"2019-10-09T16:00:39.663174Z","filtered":74,"index":29722786,"pulled":12,"total":86}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates","@module":"client","@timestamp":"2019-10-09T16:00:39.663424Z","added":0,"ignored":74,"removed":0,"updated":12}
Oct 9 16:00:39 w-008a19cc-0017 nomad: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2019-10-09T16:00:39.694808Z","added":0,"errors":0,"ignored":74,"removed":0,"updated":12}
I don't see anything weird, but maybe you do. Maybe the allocation was killed but internally it wasn't removed from the GC heap and it was tried to be garbage collected again and that produced a lock on the destroyAllocRunner
method? And that lead to when gc_max_allocs
is reached the method MakeRoomFor
(that calls destroyAllocRunner
) blocks and no allocations gets placed on the node after that.
@jorgemarey , yeah, that's sort of what i was theorizing as well.
let me know how 0.9.6 works out.
I too often notice this unexplained behavior where the cluster seems to be waiting for something, dunno what.
For my experiments I have suspected the default backoff timings which cause it to go into some sort of wait.
Usually to get over it quickly, I use nomad-helper with reevaluate-all. I know it is no solution, but a workaround.
Hi @cgbaker . This happened to me again on v0.9.6
Here is some information about the node:
ID = 1e3b2610-2de1-f3d3-6278-10f9442c5859
Name = w-8af25c22-0011
Class = default
DC = my-dc
Drain = true; 2019-11-27T08:50:54Z deadline; ignoring system jobs
Eligibility = ineligible
Status = ready
Uptime = 311h9m59s
Drivers
Driver Detected Healthy Message Time
docker true true Healthy 2019-11-14T08:56:50Z
exec true true Healthy 2019-11-14T08:56:50Z
java false false <none> 2019-11-14T08:56:50Z
qemu false false <none> 2019-11-14T08:56:50Z
raw_exec false false disabled 2019-11-14T08:56:50Z
rkt false false Failed to execute rkt version: exec: "rkt": executable file not found in $PATH 2019-11-14T08:56:50Z
Node Events
Time Subsystem Message Details
2019-11-27T07:50:54Z Drain Node drain strategy set <none>
2019-11-14T08:42:55Z Cluster Node re-registered <none>
2019-11-14T08:42:09Z Cluster Node heartbeat missed <none>
2019-10-30T14:07:12Z Cluster Node registered <none>
Allocated Resources
CPU Memory Disk
19766/24746 MHz 24 GiB/30 GiB 8.6 GiB/27 GiB
Allocation Resource Utilization
CPU Memory
2157/24746 MHz 5.9 GiB/30 GiB
Host Resource Utilization
CPU Memory Disk
6822/25496 MHz 17 GiB/31 GiB 4.4 GiB/30 GiB
Allocations
ID Eval ID Node ID Task Group Version Desired Status Created Modified
ee940ff5 016b0c20 1e3b2610 my-awesome-task-group-01 824638234992 run pending 2019-11-26T11:42:59+01:00 2019-11-26T11:42:59+01:00
7df0280f 97934eb2 1e3b2610 my-awesome-task-group-02 824637944160 stop pending 2019-11-26T10:48:22+01:00 2019-11-26T10:57:08+01:00
50a645e9 75cb8b8e 1e3b2610 my-awesome-task-group-03 824635333456 stop pending 2019-11-26T10:34:34+01:00 2019-11-26T10:41:16+01:00
59170ea3 bb7b8702 1e3b2610 my-awesome-task-group-04 824638153008 run pending 2019-11-26T10:29:55+01:00 2019-11-26T10:29:55+01:00
486653be 8aeeebee 1e3b2610 my-awesome-task-group-05 824635331440 stop pending 2019-11-25T13:00:00+01:00 2019-11-27T07:16:59+01:00
47478c55 79e7bbda 1e3b2610 my-awesome-task-group-05 824635329888 stop pending 2019-11-23T16:53:03+01:00 2019-11-26T10:27:55+01:00
4585de6b 55da26ce 1e3b2610 my-awesome-task-group-06 824636173440 run pending 2019-11-23T16:05:32+01:00 2019-11-23T16:05:32+01:00
8c2c8616 31559a46 1e3b2610 my-awesome-task-group-07 824638154928 run pending 2019-11-23T01:00:00+01:00 2019-11-23T01:00:00+01:00
1de288e4 399f5d60 1e3b2610 my-awesome-task-group-08 824636578784 stop pending 2019-11-23T00:14:27+01:00 2019-11-26T10:28:00+01:00
6013f2b3 9895fcb5 1e3b2610 my-awesome-task-group-09 824639410272 stop running 2019-11-22T13:54:48+01:00 2019-11-27T08:50:56+01:00
5226161d 3f9cbc28 1e3b2610 my-awesome-task-group-10 824635312832 stop running 2019-11-22T13:49:37+01:00 2019-11-27T08:50:55+01:00
8da1b5f1 33c5685a 1e3b2610 my-awesome-task-group-11 824641962080 stop running 2019-11-22T13:10:40+01:00 2019-11-27T08:50:55+01:00
ddafdeb2 4d16c207 1e3b2610 my-awesome-task-group-12 824639033344 stop running 2019-11-20T11:18:49+01:00 2019-11-27T08:50:55+01:00
cdb29e98 74991fdc 1e3b2610 my-awesome-task-group-13 824641502776 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:56+01:00
02c47c6a 166dafe1 1e3b2610 my-awesome-task-group-14 824640186736 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
dd9e1cae 3fe37dd9 1e3b2610 my-awesome-task-group-15 824639807072 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
0bd892eb 658e3bd9 1e3b2610 my-awesome-task-group-16 824640188816 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
813155e1 805a515e 1e3b2610 my-awesome-task-group-17 824637945552 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
22a86288 b747ec55 1e3b2610 my-awesome-task-group-18 824635953776 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
85ad09a4 43c4a47e 1e3b2610 my-awesome-task-group-19 824637947632 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
0d0a66f6 e4fe735b 1e3b2610 my-awesome-task-group-20 824640190880 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
3c4b2b41 bdc4e0fd 1e3b2610 my-awesome-task-group-21 824635327488 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
85e1283b 61c62a1e 1e3b2610 my-awesome-task-group-22 824640026720 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
0eadac30 a9bf5d48 1e3b2610 my-awesome-task-group-23 824638155584 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
5e24e0d8 71013171 1e3b2610 my-awesome-task-group-24 824639407736 stop running 2019-11-19T18:34:24+01:00 2019-11-27T08:50:55+01:00
18f76384 8075b3ca 1e3b2610 my-awesome-task-group-25 824636572880 stop running 2019-11-19T18:29:57+01:00 2019-11-27T08:50:55+01:00
aa93b0ec 1912e991 1e3b2610 my-awesome-task-group-26 824641499936 stop running 2019-11-19T18:29:05+01:00 2019-11-27T08:50:55+01:00
9f81fd0f 1912e991 1e3b2610 my-awesome-task-group-27 824641967168 stop running 2019-11-19T18:29:05+01:00 2019-11-27T08:50:55+01:00
a395e909 ee381d2c 1e3b2610 my-awesome-task-group-28 824641969840 stop running 2019-11-19T18:29:05+01:00 2019-11-27T08:50:55+01:00
d63ee7a7 4a7b9dbd 1e3b2610 my-awesome-task-group-29 824639803520 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
57a6c5c0 82b3dfe6 1e3b2610 my-awesome-task-group-30 824635317792 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
e663560f b745ec07 1e3b2610 my-awesome-task-group-31 824642736560 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
85b14ee2 b422baae 1e3b2610 my-awesome-task-group-32 824640022512 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
e62b96b5 82b3dfe6 1e3b2610 my-awesome-task-group-33 824642732288 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
62fe2923 663196a3 1e3b2610 my-awesome-task-group-34 824643776240 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
65a363c8 26c4f2e9 1e3b2610 my-awesome-task-group-35 824643779616 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
5c1114c1 4ccdda9c 1e3b2610 my-awesome-task-group-36 824633981688 stop running 2019-11-19T14:36:30+01:00 2019-11-27T08:50:55+01:00
c4c25cbf 74b2b79d 1e3b2610 my-awesome-task-group-37 824638158584 run running 2019-11-14T09:42:55+01:00 2019-11-24T03:21:30+01:00
4432dfb2 aab128e3 1e3b2610 my-awesome-task-group-38 824636171160 run running 2019-11-14T09:42:55+01:00 2019-11-14T09:43:18+01:00
As you can see, the node is draining. I tried to move the allocaitions to another node, but they stayed running. I had to restart nomad to move the allocations.
In the moment of the first allocation that get the pending state the logs shows again:
Nov 22 23:14:27 w-8af25c22-0011 nomad: {"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2019-11-22T23:14:27.767394Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","reason":"new allocations and over max (50)"}
Nov 22 23:14:27 w-8af25c22-0011 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-22T23:14:27.768906Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7"}
That allocation doesn't appear in the list of allocations of the node (I guess it was garbage collected as the log says, but maybe not correctly). Searching the logs for that allocation, I saw a few logs referencing that allocations after this point in time (which is weird).
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"error renewing token","@module":"client.vault","@timestamp":"2019-11-23T04:42:15.755652Z","error":"failed to renew the vault token: Error making API request.\n\nURL: PUT https://vault.service.consul:8200/v1/auth/token/renew-self\nCode: 403. Errors:\n\n* permission denied"}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"failed to renew Vault token","@module":"client.alloc_runner.task_runner.task_hook.vault","@timestamp":"2019-11-23T04:42:15.755704Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"failed to renew the vault token: Error making API request.\n\nURL: PUT https://vault.service.consul:8200/v1/auth/token/renew-self\nCode: 403. Errors:\n\n* permission denied","task":"my-awesome-task-50"}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"error deriving vault tokens","@module":"client.vault","@timestamp":"2019-11-23T04:42:15.757211Z","error":"Can't request Vault token for terminal allocation"}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"error deriving token","@module":"client.vault","@timestamp":"2019-11-23T04:42:15.757264Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"Can't request Vault token for terminal allocation","task_names":["my-awesome-task-50"]}
Nov 23 04:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"failed to derive Vault token","@module":"client.alloc_runner.task_runner.task_hook.vault","@timestamp":"2019-11-23T04:42:15.757298Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"Can't request Vault token for terminal allocation","server_side":true,"task":"my-awesome-task-50"}
Following are the other logs that I found about that allocation.
When it started:
Nov 21 16:42:15 w-8af25c22-0011 nomad: {"@level":"error","@message":"update hook failed","@module":"client.alloc_runner.task_runner","@timestamp":"2019-11-21T16:42:15.746594Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"unable to get address for service \"my-awesome-task-50\": cannot use address_mode=\"driver\": no driver network exists","name":"consul_services","task":"my-awesome-task-50"}
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.746836 [INFO] (runner) creating new runner (dry: false, once: false)
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.747233 [INFO] (runner) creating watcher
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.747464 [INFO] (runner) starting
Nov 21 16:42:15 w-8af25c22-0011 nomad: 2019/11/21 16:42:15.846437 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/235243f0-f863-5600-bd2b-cf97a9c255b7/openidm-work-aws/secrets/file.env"
Nov 21 16:44:16 w-8af25c22-0011 nomad: {"@level":"info","@message":"created container","@module":"client.driver_mgr.docker","@timestamp":"2019-11-21T16:44:16.914962Z","container_id":"43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120","driver":"docker"}
Nov 21 16:44:18 w-8af25c22-0011 nomad: {"@level":"info","@message":"started container","@module":"client.driver_mgr.docker","@timestamp":"2019-11-21T16:44:18.530035Z","container_id":"43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120","driver":"docker"}
Nov 21 16:44:18 w-8af25c22-0011 consul: 2019/11/21 16:44:18 [INFO] agent: Synced service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-1"
Nov 21 16:44:18 w-8af25c22-0011 consul: 2019/11/21 16:44:18 [INFO] agent: Synced service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-2"
Another moment.
Nov 21 17:05:37 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-21T17:05:37.051190Z","access_token_id":"","access_token_name":"","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","command":["bash"],"exec_id":"fb5eaf11-ec7e-cdda-326d-42e1430e75b0","task":"my-awesome-task-50","tty":true}
Nov 21 17:10:08 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session starting","@module":"client","@timestamp":"2019-11-21T17:10:08.602232Z","access_token_id":"","access_token_name":"","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","command":["bash"],"exec_id":"e94c44d0-05fb-5ae9-5535-283cc5effbd3","task":"my-awesome-task-50","tty":true}
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-1"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered check "_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered service "_nomad-task-235243f0-f863-5600-bd2b-cf97a9c255b7-my-awesome-task-50-my-awesome-task-50-2"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [INFO] agent: Deregistered check "_nomad-check-0312618a6f394f78168db7261452fe0ad71d915c"
Nov 22 09:55:52 w-8af25c22-0011 consul: 2019/11/22 09:55:52 [ERR] http: Request PUT /v1/agent/check/deregister/_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b, error: Unknown check "_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b" from=127.0.0.1:49318
Nov 22 09:55:52 w-8af25c22-0011 nomad: {"@level":"warn","@message":"failed to update services in Consul","@module":"consul.sync","@timestamp":"2019-11-22T09:55:52.918465Z","error":"Unexpected response code: 500 (Unknown check \"_nomad-check-8321040470f60a32220eab36e0d2db5778a5be4b\")"}
Nov 22 09:55:52 w-8af25c22-0011 nomad: {"@level":"info","@message":"successfully updated services in Consul","@module":"consul.sync","@timestamp":"2019-11-22T09:55:52.925661Z"}
Nov 22 09:55:57 w-8af25c22-0011 dockerd: time="2019-11-22T09:55:57.880753963Z" level=info msg="Container 43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120 failed to exit within 0 seconds of signal 15 - using the force"
Nov 22 09:55:57 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session ended with an error","@module":"client","@timestamp":"2019-11-22T09:55:57.947800Z","code":500,"error":"stream closed"}
Nov 22 09:55:57 w-8af25c22-0011 nomad: {"@level":"info","@message":"task exec session ended with an error","@module":"client","@timestamp":"2019-11-22T09:55:57.948122Z","code":500,"error":"stream closed"}
Nov 22 09:55:58 w-8af25c22-0011 nomad: {"@level":"info","@message":"stopped container","@module":"client.driver_mgr.docker","@timestamp":"2019-11-22T09:55:58.961653Z","container_id":"43c9b4d838ea1bb1c3213a69b9b1b3b77080c0e6e2e417ac3eddb69e4ec6b120","driver":"docker"}
Nov 22 09:55:58 w-8af25c22-0011 nomad: {"@level":"error","@message":"failed to start stats collection for task with unrecoverable error","@module":"client.alloc_runner.task_runner.task_hook.stats_hook","@timestamp":"2019-11-22T09:55:58.963746Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7","error":"container stopped","task":"my-awesome-task-50"}
Nov 22 09:55:59 w-8af25c22-0011 nomad: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2019-11-22T09:55:59.126528Z","alloc_id":"235243f0-f863-5600-bd2b-cf97a9c255b7"}
As you can see, here there is another message about garbage collecting this allocation, which I guess shouln't happen.
Thats all I could find, I'll keep the logs anyway if they can help in a future. I'll be happy to provide any other info if needed.
Hi again, I think I found the root cause for our problems (at least the ones reported here). I revisited the goroute traces and found the following:
goroutine 4793402 [semacquire, 611 minutes]:
sync.runtime_Semacquire(0xc0031afc08)
/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc0031afc00)
/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks(0xc00890cf00, 0xc008d65020)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:525 +0x4b2
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).destroyImpl(0xc00890cf00)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:738 +0x2f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Destroy
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:806 +0xe2
goroutine 4793403 [select, 611 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Kill(0xc006798b40, 0x24828a0, 0xc0000c6038, 0xc006ca3400, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/lifecycle.go:80 +0x225
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks.func1(0xc0031afc00, 0xc00890cf00, 0xc0031afbf0, 0xc008d3d410, 0xc0065f3e03, 0x3, 0xc006798b40)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:514 +0xa4
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:512 +0x49c
goroutine 850953 [semacquire, 17054 minutes]:
sync.runtime_Semacquire(0xc005714558)
/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*WaitGroup).Wait(0xc005714550)
/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks(0xc00cd63200, 0xc007ae5da0)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:525 +0x4b2
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).destroyImpl(0xc00cd63200)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:738 +0x2f
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).Destroy
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:806 +0xe2
goroutine 850954 [select, 17054 minutes]:
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Kill(0xc002354d80, 0x24828a0, 0xc0000c6038, 0xc00cb9e000, 0xc0001aa590, 0xc002cbf7b0)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/lifecycle.go:80 +0x225
github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks.func1(0xc005714550, 0xc00cd63200, 0xc005714540, 0xc00601b2c0, 0xc00066df5b, 0x3, 0xc002354d80)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:514 +0xa4
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).killTasks
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:512 +0x49c
Two goroutes were blocking when waiting to finish task killing. Thus reaching gc_parallel_destroys
, which made new allocations block when the node has more than gc_max_allocs
.
The livecycle manager calls killCtxCancel
and wait for it for the task to end o the context to be done. The problem is that none of those things happend and made that block.
Inspecting the goroutines I found 2 that started on task_runner and were blocking:
goroutine 387964 [select, 18503 minutes]:
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc001484c00, 0xc006ad7180, 0x1)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:242 +0xcc
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(0xc001484c00, 0xc000051900, 0x963ce1)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:253 +0x2b
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000d34dc0, 0x1fd93e0, 0xc00d226080, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:741 +0x5c0
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000d34dc0, 0x0, 0x10)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:648 +0x3d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc004a47d40, 0xc00c701838, 0xc00c701828, 0x9699c0, 0xc004a47d40)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:534 +0x7f
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg(0xc004a47d40, 0x1fd93e0, 0xc00d226080, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:647 +0x8d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.invoke(0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00d226060, 0x1fd93e0, 0xc00d226080, 0xc002043200, 0x0, ...)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:73 +0x141
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc002043200, 0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00d226060, 0x1fd93e0, 0xc00d226080, 0x0, ...)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:37 +0x1b3
github.com/hashicorp/nomad/client/logmon/proto.(*logMonClient).Stop(0xc00020ea08, 0x24828a0, 0xc0000c6030, 0xc00d226060, 0x0, 0x0, 0x0, 0x0, 0x0, 0x22635e8)
/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/proto/logmon.pb.go:268 +0xd2
github.com/hashicorp/nomad/client/logmon.(*logmonClient).Stop(0xc006803a40, 0x113f467, 0xc00c701ac8)
/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/client.go:40 +0x7a
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*logmonHook).Stop(0xc0014e9080, 0x2482860, 0xc008881ec0, 0xc000a44400, 0x3af5418, 0xc0031c2901, 0x5)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/logmon_hook.go:195 +0x6f
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).stop(0xc002354d80, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:366 +0x2a5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc002354d80)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:529 +0x5f1
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:297 +0xa6
goroutine 424573 [select, 915 minutes]:
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00d3b5600, 0xc00cb98af0, 0x1)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:242 +0xcc
github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(0xc00d3b5600, 0xc000046000, 0x963ce1)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/internal/transport/transport.go:253 +0x2b
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc00b064160, 0x1fd93e0, 0xc00b58d820, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:741 +0x5c0
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc00b064160, 0x0, 0x10)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:648 +0x3d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc000d857a0, 0xc004ac9838, 0xc004ac9828, 0x9699c0, 0xc000d857a0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:534 +0x7f
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*clientStream).RecvMsg(0xc000d857a0, 0x1fd93e0, 0xc00b58d820, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/stream.go:647 +0x8d
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.invoke(0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00b58d800, 0x1fd93e0, 0xc00b58d820, 0xc00a60a900, 0x0, ...)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:73 +0x141
github.com/hashicorp/nomad/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc00a60a900, 0x24828a0, 0xc0000c6030, 0x221ee96, 0x30, 0x1fd9300, 0xc00b58d800, 0x1fd93e0, 0xc00b58d820, 0x0, ...)
/opt/gopath/src/github.com/hashicorp/nomad/vendor/google.golang.org/grpc/call.go:37 +0x1b3
github.com/hashicorp/nomad/client/logmon/proto.(*logMonClient).Stop(0xc0061d80a8, 0x24828a0, 0xc0000c6030, 0xc00b58d800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x22635e8)
/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/proto/logmon.pb.go:268 +0xd2
github.com/hashicorp/nomad/client/logmon.(*logmonClient).Stop(0xc003122ea0, 0x113f467, 0xc004ac9ac8)
/opt/gopath/src/github.com/hashicorp/nomad/client/logmon/client.go:40 +0x7a
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*logmonHook).Stop(0xc002732300, 0x2482860, 0xc003965a40, 0xc00000fbd8, 0x3af5418, 0xc00212f001, 0x5)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/logmon_hook.go:195 +0x6f
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).stop(0xc006798b40, 0x0, 0x0)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner_hooks.go:366 +0x2a5
github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*TaskRunner).Run(0xc006798b40)
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/task_runner.go:529 +0x5f1
created by github.com/hashicorp/nomad/client/allocrunner.(*allocRunner).runTasks
/opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner.go:297 +0xa6
Looking at the code I saw that when livecycle calls killCtxCancel that cancels the context, but the only posthook that does not use that context is logmon (there the context is ignored). Thus leading to all the blocking explained earlier.
The problem (beside the context being ingored) seems to be on the grpc library. I track the goroutine to this, but I don't really know if thats what's happening here. Maybe upgrading the grpc dependencies will fix this.
Hope that this heepls to identify the problem completely.
@jorgemarey That's some great sleuthing! I'll dig into it next week and follow up. Thank you very much!
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
I think hit this too now. I think this is what happened: Had a job that may or may not fail due to certain conditions (still in development, this is not in a production environment).
Added job.update.auto_revert = true
.
Job was failing, just like the previous allocation.
Nomad UI now says Failed due to progress deadline - no stable job version to auto revert to
. Attempting to start the job again does nothing, but the folloing shows up in logs:
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226130 error="rpc error: EOF" code=500
I am also seeing this in the log (searching for this error is what led me here):
nomad.client: DeriveVaultToken failed: recoverable=false error="Can't request Vault token for terminal allocation"
There is a Vault integration on the nomad cluster, but not for this job.
Attempting to trigger gc through the API does remove all the allocations, but does not remove the job, and no new attempt at running new revisions have any effect.
When restarting the leader, there is a batch of logs coming on another node:
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226209 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob?index=226202 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226209 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/deployment?index=226205 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/summary?index=226056 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/allocations?index=226209 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob/evaluations?index=226204 error="rpc error: EOF" code=500
http: request failed: method=GET path=/v1/job/testjob?index=226202 error="rpc error: EOF" code=500
During the failing attempts, there was no log output from the job though I would expect it, so it looks like it could be related to the loggingm as you mentioned @notnoop
Nomad version 0.11.3.
Nomad version
Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)
Operating system and Environment details
RHEL 7.5
Issue
At some point all allocations placed on a node remain in pending state.
Reproduction steps
I'm not able to reproduce this behaviour, but it happend several times on different nodes on our cluster once we updated from 0.8.4 to 0.9.3
I found out that this happens after the GC runs when allocations reach gc_max_allocs. But it doesn't happens every time.
Nomad Client logs (if appropriate)
As you can see in the logs, the allocation with ID:
e6ee3260-a1c9-c283-ad17-d1cb6950774d
doesn't get garbage collected.That allocation shows the following (note that times in the log are UTC and in my PC are GMT+2):
The last event is when I restarted the nomad service in the instance. Seems like nomad is trying to garbage collect an already dead allocation.
As this happened to us several times before I enabled the profiler on the node to get some data. I saw the following (taken at Thu Oct 10 09:00:00 CEST 2019):
The time of first one (595 minutes) is the same that the first pending allocation had when placed on the node (Oct 9 21:05:00). The other tracks back to this moment:
Seems like the allocation
ec8a075b-7946-7fee-303f-a02fdf53ae1a
doesnt't get correctly garbage collected either.It seems that when the node reach
gc_max_allocs
andgc_parallel_destroys
are blocked all allocations that are placed on that node remain pending state until the nomad service is rebooted. At least that is what I feel is happening here.Weirdly, I thought that pending allocations would be blocking here: https://github.com/hashicorp/nomad/blob/v0.9.3/client/gc.go#L180
But no goroutines appeared there, so they must be waiting someplace else...