Rolling updates doens;t work well for system jobs

Nomad version

0.8.6

We have system job with follow roling update config

        update
        {
                stagger = "30m"
                max_parallel = 1
        }

Job have one task group with count of 3. When we update it some times nomad doens't wait 30m between task restarts, and update all task at one time:

32447518  c365f151  msgserver   10       run      running  15m23s ago  9m24s ago
134796b5  72d34724  msgserver   10       run      running  15m27s ago  1m31s ago
d61b438c  d66d713b  msgserver   10       run      running  19m13s ago  5m15s ago
9b4da9b8  d66d713b  msgserver   9        stop     complete  1d4h ago    18m ago
2897e471  72d34724  msgserver   9        stop     complete  3d2h ago    14m14s ago
5c2bf755  c365f151  msgserver   9        stop     complete  7d16h ago   14m18s ago

In output above we can see, that distance between task restarts much less that must be(30m)

Perhaps, one of one way to reproduce is to make a node drain in moment of rolling update. But before, when we saw this behavior, we doesn't do any node drain

@tantra35 system jobs can only have a count of 1. Do you have a full job spec you can post here?

@preetapan Sorry I misspelled it. I mean job running on 3 instances. And here job spec

job "msgserver"
{
        # region = "global"
        datacenters = ["aws"]
        type = "system"

        priority = 70

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                attribute = "${node.class}"
                value = "msgserver"
        }

        update
        {
                stagger = "30m"
                max_parallel = 1
        }

        group "msgserver"
        {
                task "msgserver"
                {
                        leader = true
                        driver = "raw_exec"
                        kill_timeout = "1m"

                        artifact
                        {
                                source = "s3::https://s3.amazonaws.com/nomad-artifacts/msgserver/msgserver.14.tar.gz"
                        }

                        config
                        {
                                command = "start.sh"
                        }

                        service
                        {
                                name = "srvname"
                                port = "MSGSERVER"
                        }

                        logs
                        {
                                max_files = 3
                                max_file_size = 10
                        }

                        resources
                        {
                                memory = 1024

                                network
                                {
                                        mbits = 10
                                        port "appport"
                                        {
                                                static = 8192
                                        }

                                        port "MSGSERVER"
                                        {
                                        }
                                }
                        }
                }

                task "fluend"
                {
                        driver = "raw_exec"

                        artifact
                        {
                                source = "s3::https://s3.amazonaws.com/nomad-artifacts/td-agent/td-agent.3.tar.gz"
                        }

                        config
                        {
                                command = "bin/td-agent"
                                args = ["--no-supervisor", "-c", "${NOMAD_TASK_DIR}/td-agent.conf"]
                        }

                        template {
                                data = <<EOH
<source>
  @type tail

  read_from_head true

  path "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stdout.*"
  pos_file "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stdout-log.pos"

  tag "prod.msgserver.{{ env "APPNAME" }}.out"

  format multiline
  format_firstline /\d{4}\.\d{2}\.\d{2} \d{2}:\d{2}:\d{2} \d+ \|/
  format1 /(?<time>\d{4}\.\d{2}\.\d{2} \d{2}:\d{2}:\d{2}) (?<pid>\d+) \| (?<message>.*)/
  time_format "%Y.%m.%d %H:%M:%S"
</source>

<source>
  @type tail

  read_from_head true

  path "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stderr.*"
  pos_file "{{ env "NOMAD_ALLOC_DIR" }}/logs/msgserver.stderr-log.pos"

  tag "prod.msgserver.{{ env "APPNAME" }}.err"

  multiline_flush_interval 3s
  format multiline
  format_firstline /^\S+/
  format1 /(?<message>.*)/
</source>

<filter prod.msgserver.**>
  @type record_transformer
  <record>
    appname "{{ env "APPNAME" }}"
    hostname ${hostname}
  </record>
</filter>

<match **>
  @type forward

  heartbeat_interval 20s
  phi_threshold 16

  send_timeout 5s
  recover_wait 30s

  flush_interval 5s
  retry_wait 10s
  max_retry_wait 5m

  <server>
    name graphite
    host fluentd.service.consul
    port 24224
  </server>
</match>
                EOH

                                destination = "local/td-agent.conf"
                        }

                        env
                        {
                                APPNAME = "appname"
                        }

                        logs
                        {
                                max_files = 3
                                max_file_size = 10
                        }

                        resources
                        {
                                memory = 150
                                cpu = 50
                        }
                }
        }
}

@preetapan @cgbaker Just checked this on test stand, and we made conclusions that rolling updates are fully broken for all types of jobs.

We test with follow job definition

job "vault_debug-00"
{
        datacenters = ["test"]
        priority = 50

        constraint
        {
                attribute = "${attr.kernel.name}"
                value = "linux"
        }

        constraint
        {
                distinct_hosts = true
        }

        update
        {
                stagger = "1800s"
                max_parallel = 1
        }

        meta
        {
                CONTAINER_VERSION = "02"
        }

        group test
        {
                count = 3

                task "vault_debug_task"
                {
                        driver = "docker"
                        kill_timeout = "1m"

                        artifact
                        {
                                source = "http://docker.service.consul/playrix-vault_debug-${NOMAD_META_CONTAINER_VERSION}.tar.gz"

                                options
                                {
                                        archive=false
                                }
                        }

                        config
                        {
                                command = "/sbin/init_plrx"
                                args = ["-c", "/sbin/runsvdir -P /etc/service"]
                        }

                        vault
                        {
                                policies = ["bind_password","vault_debug-consul"]
                                change_mode   = "noop"
                        }

                        template
                        {
                                data = <<EOH
                                CONSUL_TOKEN="{{with secret "secrets/consul/plr/global/creds/vault_debug-consul"}}{{.Data.token}}{{end}}"
EOH
                                destination = "secrets/consul_token.env"
                                env = true
                        }

                        logs
                        {
                                max_files = 3
                                max_file_size = 10
                        }

                        resources
                        {
                                memory = 500
                                cpu = 200
                        }
                }
        }
}

Then to test rolling updates we begin reduce memory from 500 to 400 then to 300 then to 200, and see follow

vagrant@consulnomad-1:~$ nomad status vault_debug-00
ID            = vault_debug-00
Name          = vault_debug-00
Submit Date   = 2018-10-17T23:42:32+03:00
Type          = service
Priority      = 50
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
test        0       0         3        0       9         0

Latest Deployment
ID          = dd1173e2
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
test        3        3       3        0          2018-10-17T23:53:32+03:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
4c29e949  79e25c0e  test        3        run      running   6m33s ago   6m15s ago
d248228b  51d21e3d  test        3        run      running   6m56s ago   6m34s ago
67b721a6  f6f07f1f  test        3        run      running   7m15s ago   6m57s ago
7c1f6d1d  a7e6f34b  test        2        stop     complete  9m7s ago    6m28s ago
8cbe7324  79e25c0e  test        2        stop     complete  9m26s ago   6m51s ago
8f6e8db7  51d21e3d  test        2        stop     complete  9m45s ago   7m11s ago
800287bf  f6f07f1f  test        1        stop     complete  12m27s ago  9m3s ago
8aa67916  a7e6f34b  test        1        stop     complete  12m46s ago  9m22s ago
171a7060  79e25c0e  test        1        stop     complete  13m9s ago   9m41s ago
25ace39b  f6f07f1f  test        0        stop     complete  49m13s ago  12m41s ago
cd147c95  51d21e3d  test        0        stop     complete  49m13s ago  12m21s ago
7fc23361  a7e6f34b  test        0        stop     complete  49m13s ago  13m4s ago

Distance between allocations launch, not 30 minutes, and much less

https://www.nomadproject.io/docs/job-specification/update.html

stagger (string: "30s") - Specifies the delay between migrating allocations off nodes marked for draining. This is specified using a label suffix like "30s" or "1h".

You probably care about min_healthy_time for your service jobs :)

(string: "10s") - Specifies the minimum time the allocation must be in the healthy state before it is marked as healthy and unblocks further allocations from being updated. This is specified using a label suffix like "30s" or "15m".

@jippi yes you right, for service job type new update block work, and reaction on min_healthy_time(in my last post i made incorrect conclude) are happens, but for system job types this new parameters absolutely ignores, and only stager work fo them(unfortunately, the documentation here does not indicate the true state of things),

On test stand we 100% reproduce situation when one of nodes are draining and system scheduler restart all allocation for a short time, without waiting required timeouts. Also we found, that rolling update for service job types also behave incorrectly, when during this process one of node are draining.

To demonstrate what actually goes wrong, on test stand we launch nomad with debug logs. Configurations of test stand is follow 3 nomad servers and 4 nomad clients.

We launch system job, with follow update stanza:

        update
        {
                stagger = "30m"
                max_parallel = 1
        }

then make some change in it and launch nomad run -detach ./vault_debug.test-05.nomad to trigger rolling update. After that logs from nomad server looks like this:

Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.907400 [DEBUG] worker: dequeued evaluation e75af621-7ffc-5e35-239f-8623515bc939
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.921771 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 0) (update 4) (migrate 0) (stop 0) (ignore 0) (lost 0)
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.921881 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: 0 in-place updates of 4
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.923550 [DEBUG] worker: created evaluation <Eval "97ef3105-79dc-9249-a230-736e8783079b" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.923574 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: rolling update limit reached, next eval '97ef3105-79dc-9249-a230-736e8783079b' created
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.929813 [DEBUG] worker: submitted plan at index 83 for evaluation e75af621-7ffc-5e35-239f-8623515bc939
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.930056 [DEBUG] sched: <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">: setting status to complete
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.932338 [DEBUG] worker: updated evaluation <Eval "e75af621-7ffc-5e35-239f-8623515bc939" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:23:59 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:23:59.932584 [DEBUG] worker: ack for evaluation e75af621-7ffc-5e35-239f-8623515bc939

then on some node we made node drain (nomad node drain -enable -yes c45751e3), and additional log output appear:

Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.476138 [DEBUG] worker: dequeued evaluation be59677d-5400-42f4-0ca0-b698d0002662
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.476654 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 0) (update 2) (migrate 1) (stop 0) (ignore 1) (lost 0)
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.476988 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: 0 in-place updates of 2
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.478065 [INFO] nomad.drain: node "c45751e3-1c5e-e529-7215-172d74db0063" completed draining at index 94
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.478721 [DEBUG] worker: created evaluation <Eval "22500713-3736-4df0-f619-17c544c75237" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.478966 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: rolling update limit reached, next eval '22500713-3736-4df0-f619-17c544c75237' created
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.490021 [DEBUG] worker: submitted plan at index 96 for evaluation be59677d-5400-42f4-0ca0-b698d0002662
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.490042 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: setting status to complete
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.492300 [DEBUG] worker: updated evaluation <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.492698 [DEBUG] worker: ack for evaluation be59677d-5400-42f4-0ca0-b698d0002662

And as expected one of allocations are updated, but this it too early(in job rolling updates are made in 30m intervals with count of 1)

then we undrain node (nomad node drain -disable -yes c45751e3), which was previously drained, and again additional log appear

2018/10/19 14:25:26.332766 [DEBUG] worker: dequeued evaluation 73071556-fd12-0ede-f979-dbb376fa795f
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:26.333250 [DEBUG] sched: <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 1) (update 1) (migrate 0) (stop 0) (ignore 2) (lost 0)
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:26.333427 [DEBUG] sched: <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">: 0 in-place updates of 1
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:26.336965 [DEBUG] worker: submitted plan at index 110 for evaluation 73071556-fd12-0ede-f979-dbb376fa795f
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:26.336990 [DEBUG] sched: <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">: setting status to complete
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:26.339395 [DEBUG] worker: updated evaluation <Eval "73071556-fd12-0ede-f979-dbb376fa795f" JobID: "vault_debug-05" Namespace: "default">
Oct 19 14:25:26 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:26.339639 [DEBUG] worker: ack for evaluation 73071556-fd12-0ede-f979-dbb376fa795f

and again one of allocations updated too early

So as i final we have follow picture:

vagrant@consulnomad-1:~$ nomad status vault_debug-05
ID            = vault_debug-05
Name          = vault_debug-05
Submit Date   = 2018-10-19T14:23:59+03:00
Type          = system
Priority      = 50
Datacenters   = test
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
test        0       0         4        0       4         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
8be8ad29  c45751e3  test        1        run      running   14m11s ago  14m8s ago
0c7d54de  e722631d  test        1        run      running   14m11s ago  14m2s ago
033d869d  b98fe747  test        1        run      running   14m31s ago  14m20s ago
6afae84f  c80b0cb2  test        1        run      running   15m37s ago  15m29s ago
286e7662  b98fe747  test        0        stop     complete  18m11s ago  14m24s ago
2dd59904  e722631d  test        0        stop     complete  18m11s ago  14m6s ago
36a4618f  c45751e3  test        0        stop     complete  18m11s ago  14m24s ago
a717ab2f  c80b0cb2  test        0        stop     complete  18m11s ago  15m32s ago

Distance between allocations Creation time very low(unexpected), but must be about 30m

But i must said that in our production, similar situation happens without node drain(here i mention about this only as a key to reproduce).

The key reason why this happens is understandably,

Oct 19 14:25:06 consulnomad-3 nomad.sh[20930]:     2018/10/19 14:25:06.476654 [DEBUG] sched: <Eval "be59677d-5400-42f4-0ca0-b698d0002662" JobID: "vault_debug-05" Namespace: "default">: allocs: (place 0) (update 2) (migrate 1) (stop 0) (ignore 1) (lost 0)

nomad server must ignore all remaining allocations and not update them if this not needed, but if any event happens in cluster(for example nodes flapping(which throw node-update) due bad network - this now happens often in aws environment) with nodes on which allocations are placed and that must be rolling updated, this allocations will be relaunched too early

So rolling updates are partially works wrong(behaive unstable if comparable with previous versions of nomad)

@tantra35 sorry for the late reply. This is a known issue with the system scheduler, the stagger parameter creates a follow up evaluation for the scheduler at a later time, but any extraneous event such as node drain or a lost node that causes the scheduler to look at that job again would still cause an alloc update.

We are tracking improvements to the system scheduler in #4740 where we will add all recent enhancements to deployments for batch and service jobs to the system scheduler. I am closing this issue in favor of #4740

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad

Rolling updates doens;t work well for system jobs #4786

Nomad version