Closed apenney closed 6 years ago
Would you be able to test Nomad 0.7.1-rc1 that was just released Monday? Link to Linux amd64 binary: https://releases.hashicorp.com/nomad/0.7.1-rc1/nomad_0.7.1-rc1_linux_amd64.zip Release notes: https://groups.google.com/d/msg/nomad-tool/-aUbGM2_ou0/FMawwLq1AgAJ
I believe #3445 fixes your issue.
@apenney
curl -XPUT http://127.0.0.1:4646/v1/system/gc
and collect the logs from all the servers.curl http://127.0.0.1:4646/v1/job/<job>/allocations
0.7.1 doesn't seem to have helped @schmichael, I'm still in the same situation.
1: Most of them are dead:
root@ip-10-30-1-43:/var/nomad/alloc# nomad job status | grep dead | wc -l
167
2:
Here's the logs from the 0.7.1-rc1 box:
That job:
uniporter-uat/periodic-1511175600 batch 50 dead (stopped) 11/20/17 11:00:00 UTC
I managed to run nomad stop -purge $job
on all these jobs, and they no longer exist:
curl http://10.30.1.43:4646/v1/job/uniporter-uat/periodic-1511953200
job not found
root@ip-10-30-1-43:/var/nomad# curl http://10.30.1.43:4646/v1/job/uniporter-uat/periodic-1511953200/allocations
[{"ID":"3b81f235-c6c2-0a04-9dca-28e9652fa43d","EvalID":"cafede1e-575e-b099-22ce-e6cb837acdb8","Name":"uniporter-uat/periodic-1511953200.ibm-wfo[0]","NodeID":"7cd5a435-ae09-ab0a-1f90-c4228be55cce","JobID":"uniporter-uat/periodic-1511953200","JobVersion":0,"TaskGroup":"ibm-wfo","DesiredStatus":"stop","DesiredDescription":"alloc not needed due to job update","ClientStatus":"complete","ClientDescription":"","TaskStates":{"jtcc":{"State":"dead","Failed":false,"Restarts":23,"LastRestart":"2017-12-13T19:52:37.06046043Z","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z","Events":[{"Type":"Driver","Time":1513194722835388017,"Message":"","DisplayMessage":"Downloading image cotalabs/uniporter:0.1.1-SNAPSHOT","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"Downloading image cotalabs/uniporter:0.1.1-SNAPSHOT","GenericSource":""},{"Type":"Driver Failure","Time":1513194722918026138,"Message":"","DisplayMessage":"failed to initialize task \"jtcc\" for alloc \"3b81f235-c6c2-0a04-9dca-28e9652fa43d\": Failed to pull `cotalabs/uniporter:0.1.1-SNAPSHOT`: API error (404): {\"message\":\"manifest for cotalabs/uniporter:0.1.1-SNAPSHOT not found\"}\n","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"failed to initialize task \"jtcc\" for alloc \"3b81f235-c6c2-0a04-9dca-28e9652fa43d\": Failed to pull `cotalabs/uniporter:0.1.1-SNAPSHOT`: API error (404): {\"message\":\"manifest for cotalabs/uniporter:0.1.1-SNAPSHOT not found\"}\n","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Restarting","Time":1513194722918074121,"Message":"","DisplayMessage":"Task restarting in 15.960969333s","Details":null,"FailsTask":false,"RestartReason":"Restart within policy","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":15960969333,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Driver","Time":1513194738881431161,"Message":"","DisplayMessage":"Downloading image cotalabs/uniporter:0.1.1-SNAPSHOT","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"Downloading image cotalabs/uniporter:0.1.1-SNAPSHOT","GenericSource":""},{"Type":"Driver Failure","Time":1513194738962467969,"Message":"","DisplayMessage":"failed to initialize task \"jtcc\" for alloc \"3b81f235-c6c2-0a04-9dca-28e9652fa43d\": Failed to pull `cotalabs/uniporter:0.1.1-SNAPSHOT`: API error (404): {\"message\":\"manifest for cotalabs/uniporter:0.1.1-SNAPSHOT not found\"}\n","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"failed to initialize task \"jtcc\" for alloc \"3b81f235-c6c2-0a04-9dca-28e9652fa43d\": Failed to pull `cotalabs/uniporter:0.1.1-SNAPSHOT`: API error (404): {\"message\":\"manifest for cotalabs/uniporter:0.1.1-SNAPSHOT not found\"}\n","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Restarting","Time":1513194738962563682,"Message":"","DisplayMessage":"Task restarting in 17.911645299s","Details":null,"FailsTask":false,"RestartReason":"Restart within policy","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":17911645299,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Driver","Time":1513194756879242992,"Message":"","DisplayMessage":"Downloading image cotalabs/uniporter:0.1.1-SNAPSHOT","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"Downloading image cotalabs/uniporter:0.1.1-SNAPSHOT","GenericSource":""},{"Type":"Driver Failure","Time":1513194757059836499,"Message":"","DisplayMessage":"failed to initialize task \"jtcc\" for alloc \"3b81f235-c6c2-0a04-9dca-28e9652fa43d\": Failed to pull `cotalabs/uniporter:0.1.1-SNAPSHOT`: API error (404): {\"message\":\"manifest for cotalabs/uniporter:0.1.1-SNAPSHOT not found\"}\n","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"failed to initialize task \"jtcc\" for alloc \"3b81f235-c6c2-0a04-9dca-28e9652fa43d\": Failed to pull `cotalabs/uniporter:0.1.1-SNAPSHOT`: API error (404): {\"message\":\"manifest for cotalabs/uniporter:0.1.1-SNAPSHOT not found\"}\n","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Restarting","Time":1513194757060460430,"Message":"","DisplayMessage":"Task restarting in 15.453403262s","Details":null,"FailsTask":false,"RestartReason":"Restart within policy","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":15453403262,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Killed","Time":1513194765024101601,"Message":"","DisplayMessage":"Task successfully killed","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}]}},"DeploymentStatus":null,"CreateIndex":607390,"ModifyIndex":607806,"CreateTime":1513194279915270907,"ModifyTime":0},{"ID":"8c83f1cb-8589-7773-b88a-82f63ad1caf9","EvalID":"d2258389-43f3-8ab7-e738-0e5b57e1b2c6","Name":"uniporter-uat/periodic-1511953200.ibm-wfo[0]","NodeID":"7cdec421-166e-04da-07e1-2b37858357a3","JobID":"uniporter-uat/periodic-1511953200","JobVersion":0,"TaskGroup":"ibm-wfo","DesiredStatus":"run","DesiredDescription":"","ClientStatus":"failed","ClientDescription":"","TaskStates":{"jtcc":{"State":"dead","Failed":true,"Restarts":0,"LastRestart":"0001-01-01T00:00:00Z","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z","Events":[{"Type":"Received","Time":1513115825621384244,"Message":"","DisplayMessage":"Task received by client","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1513115825621496358,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Killing","Time":1513115825636789224,"Message":"","DisplayMessage":"Killing task: vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"8c83f1cb-8589-7773-b88a-82f63ad1caf9\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","Details":null,"FailsTask":true,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"8c83f1cb-8589-7773-b88a-82f63ad1caf9\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}]}},"DeploymentStatus":null,"CreateIndex":595105,"ModifyIndex":595302,"CreateTime":1513115818001542460,"ModifyTime":0},{"ID":"94936d21-87ec-68da-ac40-31fc7d8f74fd","EvalID":"5039797e-5886-cef2-b7b0-3c85efec7108","Name":"uniporter-uat/periodic-1511953200.ibm-wfo[0]","NodeID":"7cd5a435-ae09-ab0a-1f90-c4228be55cce","JobID":"uniporter-uat/periodic-1511953200","JobVersion":0,"TaskGroup":"ibm-wfo","DesiredStatus":"run","DesiredDescription":"","ClientStatus":"failed","ClientDescription":"","TaskStates":{"jtcc":{"State":"dead","Failed":true,"Restarts":0,"LastRestart":"0001-01-01T00:00:00Z","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z","Events":[{"Type":"Received","Time":1513112912870960502,"Message":"","DisplayMessage":"Task received by client","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1513112912871048881,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Killing","Time":1513112912901171716,"Message":"","DisplayMessage":"Killing task: vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"94936d21-87ec-68da-ac40-31fc7d8f74fd\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","Details":null,"FailsTask":true,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"94936d21-87ec-68da-ac40-31fc7d8f74fd\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}]}},"DeploymentStatus":null,"CreateIndex":592027,"ModifyIndex":607786,"CreateTime":1513112903271182797,"ModifyTime":0},{"ID":"b73bba18-671e-5f9d-4217-1c9b2b22f723","EvalID":"9d143393-2a24-5f75-7517-7146997074e2","Name":"uniporter-uat/periodic-1511953200.ibm-wfo[0]","NodeID":"7cd5a435-ae09-ab0a-1f90-c4228be55cce","JobID":"uniporter-uat/periodic-1511953200","JobVersion":0,"TaskGroup":"ibm-wfo","DesiredStatus":"run","DesiredDescription":"","ClientStatus":"failed","ClientDescription":"","TaskStates":{"jtcc":{"State":"dead","Failed":true,"Restarts":0,"LastRestart":"0001-01-01T00:00:00Z","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z","Events":[{"Type":"Received","Time":1513115803904081416,"Message":"","DisplayMessage":"Task received by client","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1513115803904151831,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Killing","Time":1513115803931710036,"Message":"","DisplayMessage":"Killing task: vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"b73bba18-671e-5f9d-4217-1c9b2b22f723\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","Details":null,"FailsTask":true,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"b73bba18-671e-5f9d-4217-1c9b2b22f723\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}]}},"DeploymentStatus":null,"CreateIndex":594855,"ModifyIndex":607786,"CreateTime":1513115803876082994,"ModifyTime":0},{"ID":"c7d1093e-ca1b-f385-0502-34d67cf7b9d3","EvalID":"1b9981c7-fcbf-ea3a-1bf9-ded95897aadc","Name":"uniporter-uat/periodic-1511953200.ibm-wfo[0]","NodeID":"7cdec421-166e-04da-07e1-2b37858357a3","JobID":"uniporter-uat/periodic-1511953200","JobVersion":0,"TaskGroup":"ibm-wfo","DesiredStatus":"run","DesiredDescription":"","ClientStatus":"failed","ClientDescription":"","TaskStates":{"jtcc":{"State":"dead","Failed":true,"Restarts":0,"LastRestart":"0001-01-01T00:00:00Z","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z","Events":[{"Type":"Received","Time":1513115398637824126,"Message":"","DisplayMessage":"Task received by client","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1513115398637986165,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Killing","Time":1513115398653332452,"Message":"","DisplayMessage":"Killing task: vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"c7d1093e-ca1b-f385-0502-34d67cf7b9d3\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","Details":null,"FailsTask":true,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"c7d1093e-ca1b-f385-0502-34d67cf7b9d3\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}]}},"DeploymentStatus":null,"CreateIndex":594082,"ModifyIndex":594373,"CreateTime":1513115383337841929,"ModifyTime":0},{"ID":"d16d4240-ee9a-9bea-4eac-23a032b75336","EvalID":"7fbe95da-90eb-f40d-cdbb-7ab7ca716a4d","Name":"uniporter-uat/periodic-1511953200.ibm-wfo[0]","NodeID":"7cd5a435-ae09-ab0a-1f90-c4228be55cce","JobID":"uniporter-uat/periodic-1511953200","JobVersion":0,"TaskGroup":"ibm-wfo","DesiredStatus":"run","DesiredDescription":"","ClientStatus":"failed","ClientDescription":"","TaskStates":{"jtcc":{"State":"dead","Failed":true,"Restarts":0,"LastRestart":"0001-01-01T00:00:00Z","StartedAt":"0001-01-01T00:00:00Z","FinishedAt":"0001-01-01T00:00:00Z","Events":[{"Type":"Received","Time":1513112735189272618,"Message":"","DisplayMessage":"Task received by client","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1513112735189349266,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":null,"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Killing","Time":1513112735204538511,"Message":"","DisplayMessage":"Killing task: vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"d16d4240-ee9a-9bea-4eac-23a032b75336\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","Details":null,"FailsTask":true,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"vault: failed to derive token: failed to create token for task \"jtcc\" on alloc \"d16d4240-ee9a-9bea-4eac-23a032b75336\": Connection to Vault failed: failed to lookup Vault periodic token: Error making API request.\n\nURL: POST http://vault.service.consul:8200/v1/auth/token/lookup\nCode: 403. Errors:\n\n* permission denied","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}]}},"DeploymentStatus":null,"CreateIndex":591541,"ModifyIndex":607786,"CreateTime":1513112730585011087,"ModifyTime":0}]
Then the allocs still can't gc:
Dec 13 19:59:34 ip-10-30-1-43 nomad[28460]: client: allocs: (added 0) (removed 0) (updated 4) (ignore 265)
Dec 13 19:59:34 ip-10-30-1-43 nomad[28460]: client: dropping update to terminal alloc '3b81f235-c6c2-0a04-9dca-28e9652fa43d'
Dec 13 19:59:34 ip-10-30-1-43 nomad[28460]: client: dropping update to terminal alloc '0b1e29f2-976e-1c85-2ab5-0689915a6244'
Dec 13 19:59:34 ip-10-30-1-43 nomad[28460]: client: dropping update to terminal alloc '7180fee2-c804-7484-ad90-5bb6f8727039'
Dec 13 19:59:34 ip-10-30-1-43 nomad[28460]: client: dropping update to terminal alloc '44ae7315-243e-cfdb-3f34-d95b277a9565'
We ended up bringing down all masters, wiping all raft/client.db, and starting from scratch. All gc'ing was broken across all nodes, and it looks like it's been broken since 0.5->0.6 upgrade.
@apenney Unfortunately the logs you gave are client logs not the server logs. So there really isn't enough information to try to track this down.
I am going to close this until we do get more information or you can provide repro steps. If it does happen again, following the steps I outlined in my response should provide the information we need.
Not sure if I can reopen this but here are the server logs during a gc, as this issue is now reoccurring. Server3 is the master in the below paste.
Symptoms, just to recap: Nodes fill up with allocs, and can't gc. I have over ~3000 jobs when I should have about 20, most of those are "dead" batch jobs that never go away. On the servers I can see it constantly trying to alloc and failing to get vault tokens for jobs that are dead and shouldn't be doing anything.
Server1:
Server2:
Server3:
I only have debug enabled on the first two, sorry. Probably makes it easier to read at least.
@dadgar direct ping as you closed it, you may be able to open it :)
Added this as a txt file because it's too big, but this is the allocs on a single master node, all unable to GC. You asked for this on #3604 where two other people are experiencing this.
I have a similar (though not identical) issue about nodes not collecting garbage correctly.
I am using Nomad 0.7.1/Consul 1.0.2/CentOS 7.4+/Docker CE.
My issue is that "/var/lib/docker" bloats up over time (I have image cleanup purposely set to 'false').
This fills up the disk and nomad reports the machine "ready" but no allocation go the machine as the disk capacity is exhausted.
I don't have the logs with me right now, but it keeps emitting a message along the following line:
disk usage of 98 more than the threshold of 80 but not triggering a gc as no allocations are terminal
.
So I have a node which is basically unusable but not being indicated as such by Nomad.
I know that '/var/lib/docker' bloating up is not really Nomad's problem, but just wanted to put this out there.
Regards, Shantanu
@apenney Thanks for the additional logs! In the future please attach them as text files or link a gist to make browsing this issue easier. We'll almost always want to download them to search/grep/etc.
@shantanugadgil As far as we can tell right now @apenney's issue is related to the servers GCing dead jobs. Your issue is specific to clients. Ensure you have not set docker.cleanup.image=true
in your configs and reopen #3560 if you think it's the same issue. There may be something we're missing and closed too soon!
@apenney Hey I am sorry but I just can't reproduce this. I will reopen when there are clear reproduction steps. The reproduction steps will have to include the configuration and how to get to this state from a fresh cluster.
I have even tested by creating a periodic job that fails in the same fashion as your attached allocations but as soon as I hit the GC endpoint, the jobs and allocations get removed.
@apenney And if you post logs can you please post the Nomad Server logs and not the client's logs. After you do the curl -XPUT .../v1/system/gc
you should see logs like:
2018/01/16 23:10:11.956340 [DEBUG] http: Request /v1/system/gc (71.546µs)
2018/01/16 23:10:11.956464 [DEBUG] worker: dequeued evaluation fd8b7e4c-ca04-0781-91f9-b3b847638f8d
2018/01/16 23:10:11.956494 [DEBUG] sched.core: forced job GC
2018/01/16 23:10:11.956653 [DEBUG] sched.core: job GC: 13 jobs, 13 evaluations, 13 allocs eligible
2018/01/16 23:10:11.962373 [DEBUG] sched.core: forced eval GC
2018/01/16 23:10:11.962441 [DEBUG] sched.core: forced deployment GC
2018/01/16 23:10:11.962450 [DEBUG] sched.core: forced node GC
This should be logged if you are in DEBUG level logs regardless of if anything gets garbage collected. So if you don't see that something is odd with your setup.
@dadgar thanks to the wonderful @jippi we've fixed this!
On our server nodes we had:
enabled_schedulers = ["service","batch","system"]
num_schedulers = 2
Removing these two lines immediately fixed the issue, with it starting to delete thousands of piled up batch jobs. We don't really understand -why-, unless it was starved by a low num_schedulers stopping it ever getting to gc?
I just wanted to let you know, and let the world know, in case someone else stumbles over this bug report.
@apenney glad this was fixed for you.
And you guessed right - see the docs for num_schedulers. By setting it to 2, the ability of nomad servers to be able to run scheduler workers in parallel was severely reduced. Garbage collection is done as an internal scheduled job as well so that explains why.
Had the exact same problem, running only 1 scheduler, on 1vCPU test nodes. After removing both lines mentioned above, gc started immediately. After putting both config lines back, GC still works.
On 1vCPU, the default is still 1 scheduler, so it doesn't seem to be the num_schedulers
that's creating the issue. (or not alone)
My test setup was updated in-place/with rolling redeploys from 5.6 and did almost all versions until 7.1.
Haven't restarted all my nodes yet, but I'll report back here if I'm able to reproduce the issue after restarting all nodes again.
@dadgar @preetapan I was able to reproduce the problem again in my setup by performing a rolling re-deploy.
Removing the enabled_schedulers
setting on one of the nodes and restarting nomad fixed the issue again.
When the issue is active, server logs (in debug mode) show the ../system/gc
call coming in, but there are no other log entries about GC.
I think I found the issue, the default value for enabled_schedulers
adds the _core
scheduler type.
But if a custom value is provided, the _core
scheduler type is never added to this list by the config parser.
@groggemans Great find. This was definitely not intended and we will get this fixed up for 0.8
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.7.0
Operating system and Environment details
Ubuntu 16.04 consul 0.9.2/0.9.3
Issue
Nodes won't GC old jobs:
I see submit's back as far as September. I've enabled debug logging, and my client settings are:
I see zero log lines matching 'garbage', which is what I expect to see based on gc.go.
Reproduction steps
This one is hard, I don't know why the gc gets skipped/ignored so I'm not sure what to say about reproduction.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
I deleted all the lines matching 'secret', and snipped out some company name stuff. Hopefully this shows the lack of garbage/gc tho.
Job file (if appropriate)