Open prologic opened 5 years ago
Full nomad alloc status
logs:
Recent Events:
Time Type Description
2019-03-22T01:13:27Z Killing Template failed to send signals [user defined signal 1]: 1 error(s) occurred:
* Task not running
2019-03-22T01:13:23Z Restarting Task restarting in 31.906118053s
2019-03-22T01:13:23Z Terminated Exit Code: 0
2019-03-22T01:13:23Z Signaling Template re-rendered
2019-03-22T01:12:32Z Started Task started by client
2019-03-22T01:12:31Z Downloading Artifacts Client is downloading artifacts
2019-03-22T01:12:31Z Task Setup Building Task Directory
2019-03-22T01:12:31Z Received Task received by client
The overall state also seems wrong; the UI is showing "Running" but all task groups are in a "failed" state and not going anywhere.
In fact the Job and its tasks just continue to fail even after a restart:
ID = 2dd17b65
Eval ID = 61ea2f42
Name = haproxy.haproxy[0]
Node ID = 37cb9dd1
Job ID = haproxy
Job Version = 2
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 3m33s ago
Modified = 3m4s ago
Task "haproxy" is "dead"
Task Resources
CPU Memory Disk Addresses
0/200 MHz 6.9 MiB/128 MiB 300 MiB haproxy_api: 10.0.64.160:4443
haproxy_cloud_admin: 10.0.64.160:7559
haproxy_mcfe: 10.0.64.160:443
haproxy_reporter: 10.0.64.160:5555
Task Events:
Started At = 2019-03-22T01:48:15Z
Finished At = 2019-03-22T01:48:44Z
Total Restarts = 1
Last Restart = 2019-03-22T01:48:43Z
Recent Events:
Time Type Description
2019-03-22T01:48:44Z Killing Template failed to send signals [user defined signal 1]: 1 error(s) occurred:
* Task not running
2019-03-22T01:48:43Z Restarting Task restarting in 34.83377889s
2019-03-22T01:48:43Z Terminated Exit Code: 0
2019-03-22T01:48:43Z Signaling Template re-rendered
2019-03-22T01:48:15Z Started Task started by client
2019-03-22T01:48:14Z Downloading Artifacts Client is downloading artifacts
2019-03-22T01:48:14Z Task Setup Building Task Directory
2019-03-22T01:48:14Z Received Task received by client
I don't actually expect the template to have changed in this case (after a stable cluster) -- But it does anyway?
Changing to change_mode = "restart"
seems to work for me as a "work around". I'd still love to dig inot why change_mode = "signal"
is so britle here; but alas I don't have the bandwidth today; so filing here for visibility.
That's quite intriguing - I'm investigating this now. Mind if you post haproxy config file as well? I haven't been able to reproduce the "Task not running" part. Also, I assume this running on centos 7 as well?
It's very peculiar that the task was terminated immediately when the template was re-rendered but the event about signaling failure occurred 4 seconds later:
2019-03-22T01:13:27Z Killing Template failed to send signals [user defined signal 1]: 1 error(s) occurred:
* Task not running
2019-03-22T01:13:23Z Restarting Task restarting in 31.906118053s
2019-03-22T01:13:23Z Terminated Exit Code: 0
2019-03-22T01:13:23Z Signaling Template re-rendered
The overall state also seems wrong; the UI is showing "Running" but all task groups are in a "failed" state and not going anywhere.
This is the confusing UX problem raised in https://github.com/hashicorp/nomad/issues/5408#issuecomment-475083537 . When an alloc fails and expected to be restarted (note "Task restarting in.." event), the overall job is marked as running.
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
Nomad 0.9.3 - same problem:
Killing | Template failed to send signals [hangup]: 1 error(s) occurred: * Task not running
After a few fails task can start, but it's annoying that it fails with this error several times.
Same issue here using Nomad 0.9.3:
Template failed to send signals [hangup]: 1 error(s) occurred: * Task not running
two out of three allocations fail with this error. one runs
If the signal fails to send for any reason (including because the task isn't running yet) then the task is marked as failed.
Maybe a new issue should be opened for this.
We are using nginx and consul-template for service discovery and ran into the same problem, when the nginx nomad job is starting the service index in consul updated and result in several signals being sent. This makes the nginx nomad job impossible to start.
Ran into this with many consul templates (nginx configs) on 0.11.3.
Hi, we encountered this same problem. It seems that if the template makes the task fail (some formatting problem or something like that kill the proccess), fixing the template doesn't make the allocation start again, because nomad can't make the change in the configuration due to the allocation not being running. Could it be something that collides between the template signaling and the restart behaviour?
When the template render kicks in, the task is set to be killed (it's already), and nomad doesn't try to restart it again.
Hi We encountered the same porblem on 1.0.3 When we are starting our cluster from scratch, all our nginx jobs with consul-templated configs have errors, are marked failed and have the same event list:
We found that the problem occures only when several depended task are starting at the same time. So, we have putted some delay between them to avoid this.
That's a bit annoying and I don't like reds on the job's dashboard :)
We hope, that you will manage this somehow. For example, why don't you make nomad ignore that kind of error, or won't let template send signals for not yet running tasks?
Dropping a note here that came from an observation in a support case; the problem appears to be associated with the template rendering happening when a task is stopped waiting for the restart.delay
window.
Just dropping a note that I'm seeing this with the reference Nginx Tutorial here against the latest stable release of Nomad and Consul in my lab environment (amd64 instances): https://learn.hashicorp.com/tutorials/nomad/load-balancing-nginx
Just another note to note I'm encountering this with Nomad v1.1.6 (b83d623fb5ff475d5e40df21e9e7a61834071078)
Still happening in Nomad 1.3.0-rc.1
. Different signal, same problem. (Coincidentally, also with HAProxy.) At the very least, I think that the template rendering failure should not be fatal for the task/group/job.
Here are the "Recent Events" from a task that suffered from this today:
Time Type Description
May 18, '22 16:48:28 -0600 Killing Sent interrupt. Waiting 5s before force killing
May 18, '22 16:48:28 -0600 Not Restarting Error was unrecoverable
May 18, '22 16:48:28 -0600 Killing Template failed to send signals [user defined signal 2]: 1 error occurred: * Task not running
May 18, '22 16:46:16 -0600 Restarting Task restarting in 0s
May 18, '22 16:46:16 -0600 Terminated Exit Code: 137, Signal: 9
May 18, '22 16:46:11 -0600 Restart Signaled Template with change_mode restart re-rendered
May 18, '22 16:41:15 -0600 Signaling Template re-rendered
May 18, '22 16:36:10 -0600 Started Task started by client
May 18, '22 16:36:09 -0600 Task Setup Building Task Directory
May 18, '22 16:36:03 -0600 Received Task received by client
I have not yet found a way to reproduce the problem consistently. If I do, I will post my findings.
Also seeing this with 1.3.1
in that if it happens to a job even with
restart {
attempts = 5
delay = "5s"
mode = "delay"
interval = "30s"
}
It will actually kill the task and not attempt to restart it again. I see this as a bug and unexpected behavior.
$ nomad alloc status 1854d2d9-f0fb-9c78-15c6-7c7aedf390ba
Recent Events:
Time Type Description
2022-08-06T15:49:17-07:00 Killing Template failed to send signals [hangup]: 1 error occurred:
* Task not running
2022-08-06T15:49:11-07:00 Restarting Task restarting in 5.669470174s
2022-08-06T15:49:11-07:00 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-08-06T15:49:11-07:00 Started Task started by client
2022-08-06T15:49:10-07:00 Restarting Task restarting in 5.422057036s
2022-08-06T15:49:10-07:00 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-08-06T15:49:10-07:00 Started Task started by client
2022-08-06T15:49:05-07:00 Driver Downloading image
2022-08-06T15:49:04-07:00 Task Setup Building Task Directory
2022-08-06T15:49:04-07:00 Received Task received by client
If the task isn't running, then shouldn't the task be allowed to retried as defined in the task retry stanza? In that case, I feel https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/template/template.go#L517-L520 is an unnecessary action to do. This basically kills the task even if the task is defined with a restart.delay
param.
Had the same problem today, when starting the virtual machines, Nomad was attempting to call SIGHUP on the template as it thought the process was still running but it actually was not. It then went into "failed" and never tried to start the job at all.
I'm having this problem (nginx with a dynamic template). It sure would be nice if there was some pattern or code fix to work around this.
Has anyone found any workarounds to this problem? I'd love to be able to use signals for configuration updates without downtime, but Nomad marking the job as failed when rendering/signalling fails is blocking this.
I'm suspecting this has something to do with the fact that the jobs are of type 'system', so they are not configurable by the usual reschedule{} stanza. On node bootup, if the signal is sent before Nomad registers itself with Consul, the job is properly restarted after the logs print out "client node registered". I think that the node is reevaluated and so the system jobs get placed again.
A potential workaround that I am exploring at the moment:
template {
data = <<EOF
#!/bin/sh
echo "Reloading pid 1"
kill -HUP 1
EOF
destination = "/local/reload.sh"
perms = "777"
}
template {
data = <<EOF
{{ key "config" }}
EOF
destination = "/local/config.cfg"
change_mode = "script"
change_script {
command = "/local/reload.sh"
}
}
According to the documentation, it should not kill the task if script fails: https://developer.hashicorp.com/nomad/docs/job-specification/change_script#fail_on_error
-EDIT- Running this for close to a month and works fine, allocations always start up when VM is booted.
Same here, the workaround with change_script appears to have been working ok for two weeks, much longer than I could experience with signal.
Note that you cannot use variable interpolation here, the change_script must be "/local/reload.sh" and not "${NOMAD_TASK_DIR}/reload.sh"
I have the exact issue as mentioned by https://github.com/hashicorp/nomad/issues/5459#issuecomment-475767591 I seems have the exact steps to reproduce it.
nginx task depends on app, because nginx template is depend on the application job.
job "portal" {
type = "service"
group "nginx" {
count = 1
restart {
attempts = 5
delay = "15s"
interval = "3m"
mode = "delay"
}
task "nginx-task" {
driver = "docker"
config {
volumes = [
"custom/nginx.conf:/etc/nginx/sites-enabled/internal",
]
}
template {
source = "/xxx/internal.nginx.conf.tpl"
destination = "custom/nginx.conf"
change_mode = "signal"
change_signal = "SIGHUP"
}
}
}
group "api" {
count = 1
restart {
attempts = 5
delay = "15s"
interval = "3m"
mode = "delay"
}
task "api" { ... }
}
The nginx.conf
has conf:
{{- $has_api := service "api" -}}
{{- if $has_api }}
upstream api {
{{- range $has_api }}
server {{ .Address }}:{{ .Port }};
{{- end }}
keepalive 32;
}
{{- end -}}
docker stop api-task-XXXX
Jul 17, '24 15:37:58 -0400 Killing Template failed to send signals [hangup]: 1 error occurred: * Task not running
Jul 17, '24 15:37:41 -0400 Restarting Task restarting in 18s
Jul 17, '24 15:37:41 -0400 Terminated Exit Code: 129, Exit Message: "Docker container exited with non-zero exit code: 129"
Jul 17, '24 15:37:40 -0400 Signaling Template re-rendered
Here we see the expected 'hangup' signal, but it done too late: 13s later. And we see not expected 'Terminated'. The nginx should not be terminated.
Lets see what is going during docker stop api-task-xxx
Unexpected stopped nginx:
Unexpected stopped nginx:
Jul 17 19:30:52 ip-172-31-6-94 dockerd[1445]: time="2024-07-17T19:30:52.762755889Z" level=info msg="ignoring event" container=9c4de1228188b214db797f107c1eeda54b35c51741f7d3aa4771664d05d7ba76 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 17 19:31:02 ip-172-31-6-94 dockerd[1445]: time="2024-07-17T19:31:02.555128397Z" level=info msg="ignoring event" container=feb8828be81b6aa1e07ee4b0555490fb1e5f320863d914851100dda4419e15ca module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Here docker ps
output:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
feb8828be81b xxx nginx-task-541cadde-da2e-7542-cf81-0cadf2520c20
9c4de1228188 xxx api-task-7d84118c-cff5-16fe-78f9-855342828b86
When system comes to the normal state, we see that nginx-task ID was changed, but api-taks stays same.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e1897895f288 xxx nginx-task-564f7092-755c-1e5c-5fa3-e44b3036eb99
4b81a28522f6 xxx api-task-7d84118c-cff5-16fe-78f9-855342828b86
Consul reports that nginx
was deregistered:
agent: Deregistered service: service=_nomad-task-564f7092-755c-1e5c-5fa3-e44b3036eb99-nginx-task-nginx-http-http
Day 2
Logs from nginx when api
container stoppped:
2024/07/18 14:51:48 [notice] 23#23: http file cache: /tmp/viewercache 0.000M, bsize: 4096
2024/07/18 14:51:48 [notice] 23#23: http file cache: /tmp/portalcache 0.000M, bsize: 4096
2024/07/18 14:51:48 [notice] 17#17: signal 17 (SIGCHLD) received from 23
2024/07/18 14:51:48 [notice] 17#17: cache loader process 23 exited with code 0
2024/07/18 14:51:48 [notice] 17#17: signal 29 (SIGIO) received
2024/07/18 14:53:05 [notice] 17#17: using the "epoll" event method
2024/07/18 14:53:05 [notice] 17#17: openresty/1.19.3.1
2024/07/18 14:53:05 [notice] 17#17: built by gcc 13.2.1 20231014 (Alpine 13.2.1_git20231014)
2024/07/18 14:53:05 [notice] 17#17: OS: Linux 6.8.0-1010-aws
2024/07/18 14:53:05 [notice] 17#17: getrlimit(RLIMIT_NOFILE): 1048576:1048576
Another try:
2024/07/18 15:09:47 [notice] 17#17: using the "epoll" event method
2024/07/18 15:09:47 [notice] 17#17: openresty/1.19.3.1
2024/07/18 15:09:47 [notice] 17#17: built by gcc 13.2.1 20231014 (Alpine 13.2.1_git20231014)
2024/07/18 15:09:47 [notice] 17#17: OS: Linux 6.8.0-1010-aws
2024/07/18 15:09:47 [notice] 17#17: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2024/07/18 15:09:47 [notice] 17#17: start worker processes
2024/07/18 15:09:47 [notice] 17#17: start worker process 18
2024/07/18 15:09:47 [notice] 17#17: start worker process 19
2024/07/18 15:09:47 [notice] 17#17: start worker process 20
2024/07/18 15:09:47 [notice] 17#17: start worker process 21
2024/07/18 15:09:47 [notice] 17#17: start cache manager process 22
2024/07/18 15:09:47 [notice] 17#17: start cache loader process 23
Docker logs:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2a9332e2b06d *** wi-nginx-task-67b5bbdf-42c4-6f11-b554-bcd5ff9cd3e0
e3c37b9bc2bc *** wi-api-task-c5cd5053-7011-ca7b-0795-73901d33120e
...
Jul 18 15:09:22 ip-172-31-6-94 dockerd[1445]: time="2024-07-18T15:09:22.024003081Z" level=info msg="ignoring event" container=e3c37b9bc2bce229d45a8369c4572d2ecf26d5ef93c10231a78e202dec6344d4 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jul 18 15:09:29 ip-172-31-6-94 dockerd[1445]: time="2024-07-18T15:09:29.901144364Z" level=error msg="collecting stats for container /wi-nginx-task-67b5bbdf-42c4-6f11-b554-bcd5ff9cd3e0: no metrics received"
Jul 18 15:09:29 ip-172-31-6-94 dockerd[1445]: time="2024-07-18T15:09:29.911716221Z" level=info msg="ignoring event" container=2a9332e2b06d7de628189faabea5e3c492d137acd2fad47bc490898d2ae6f986 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
This time in Consul I see that nginx still alive after 1 second docker received signal (I do not know from where) to stop nginx:
Jul 18 15:09:30 ip-172-31-6-94 consul[1696]: agent: Service in sync: service=_nomad-task-67b5bbdf-42c4-6f
11-b554-bcd5ff9cd3e0-wi-nginx-task-wi-nginx-https
...
Jul 18 15:09:30 ip-172-31-6-94 consul[1696]: agent.proxycfg.agent-state: syncing proxy services from loca
l state
Jul 18 15:09:30 ip-172-31-6-94 consul[1696]: 2024-07-18T15:09:30.298Z [INFO] agent: Deregistered service: service=_nomad-task-67b5bbdf-42c4-6f11-b554-bcd5ff9cd3e0-wi-nginx-task-wi-nginx-http-http
Jul 18 15:09:30 ip-172-31-6-94 consul[1696]: agent: Deregistered service: service=_nomad-task-67b5bbdf-42c4-6f11-b554-bcd5ff9cd3e0-wi-nginx-task-wi-nginx-http-http
Jul 18 15:09:30 ip-172-31-6-94 consul[1696]: agent: Service in sync: service=consul
Jul 18 15:09:30 ip-172-31-6-94 consul[1696]: 2024-07-18T15:09:30.298Z [DEBUG] agent: Service in sync: service=consul
From the logs above we do not see who kills NGINX (At least I do not see).
So I returned what was said on WebUI
Jul 18, '24 12:10:17 -0400 Killing Template failed to send signals [hangup]: 1 error occurred: * Task not running
Jul 18, '24 12:10:04 -0400 Restarting Task restarting in 18s
Jul 18, '24 12:10:04 -0400 Terminated Exit Code: 129, Exit Message: "Docker container exited with non-zero exit code: 129"
Jul 18, '24 12:10:03 -0400 Signaling Template re-rendered
From the link here: https://www.tencentcloud.com/document/product/457/35758#standard-linux-interruption-signals we know that 129 means SIGHUP.
So Nomad send signal as expected, but the container exits unexpectingly. Thus we should, probably refer to Docker to dig why it restart container or to entrypoint script:
c8b13dbf0245:/# ps
PID USER TIME COMMAND
1 root 0:00 /sbin/tini -- /opt/planitar/script/run.sh
7 root 0:00 {run.sh} /bin/bash /opt/planitar/script/run.sh
17 root 0:00 nginx: master process nginx -p /etc/nginx
18 www 0:00 nginx: worker process
19 www 0:00 nginx: worker process
20 www 0:00 nginx: worker process
21 www 0:00 nginx: worker process
22 www 0:00 nginx: cache manager process
24 root 0:00 bash
34 root 0:00 ps
It is still unclear to my why nginx log does not have information about received?? signal.
@tgross Probably it will be interested to you this investigation becase this issue already 5yo.
The solution was to add handler for signal into DOCKER ENTRYPOINT script:
reload_nginx() {
nginx -s reload
}
trap 'reload_nginx' SIGHUP
Using Nomad 0.9.0-beta3 here.
I have a haproxy task that I'm trying to restart on template changes (from Consul Template) but the entire task/group fails with:
And the job remains in a failed state.
Job spec looks something like this:
Couple of questions: