hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

Reserved port ranges can result in nodes being many MBs #3072

Closed dansteen closed 4 years ago

dansteen commented 7 years ago

Nomad version

Nomad v0.6.0

Operating system and Environment details

Archlinux

Issue

I have a number of nomad client boxes that were shut down. However, nomad still shows the node as being in the "ready" state when I do a nomad node-status. However, when I list the allocations on the node, some show as being lost and others show as being pending. When I do an alloc-status of one of the allocations in the lost state it tells me alloc is lost since its node is down. However, when I do an alloc-status of on of the allocations in the pending state it thinks the node is up.

Here are the details:

nomad node-status bdd69afa                                                                          

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): Get https://billing-03d.stag:4646/v1/client/stats: dial tcp 10.1.200.39:4646: i/o timeout
ID      = bdd69afa
Name    = billing-03d.stag.awse
Class   = <none>
DC      = awse
Drain   = false
Status  = ready
Drivers = exec

Allocated Resources
CPU         Memory       Disk        IOPS
0/2494 MHz  0 B/3.7 GiB  0 B/16 GiB  0/0

Allocation Resource Utilization
CPU         Memory
0/2494 MHz  0 B/3.7 GiB

error fetching node stats (HINT: ensure Client.Advertise.HTTP is set): actual resource usage not present

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
7a649641  bdd69afa  app         19       run      pending   08/21/17 17:34:28 EDT
10434734  bdd69afa  app         17       stop     lost      08/21/17 11:00:59 EDT
a70ff120  bdd69afa  app         15       stop     lost      08/18/17 12:29:45 EDT
a2db807f  bdd69afa  app         14       stop     complete  08/18/17 11:59:47 EDT
8cae4200  bdd69afa  app         14       stop     complete  08/18/17 11:30:22 EDT
5d6def5d  bdd69afa  app         13       stop     complete  08/18/17 10:30:17 EDT

Notice that it thinks the nodes status is ready. However, when I do an alloc-status on the allocation in the lost state, it knows the node is down:

nomad alloc-status 10434734                                                                         
ID                  = 10434734
Eval ID             = b17ca76d
Name                = billingsvc-stag.app[0]
Node ID             = bdd69afa
Job ID              = billingsvc-stag
Job Version         = 17
Client Status       = lost
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc is lost since its node is down
Created At          = 08/21/17 11:00:59 EDT

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Get https://billing-03d.stag:4646/v1/client/allocation/10434734-1262-5890-506c-53dc60af2b35/stats: dial tcp 10.1.200.39:4646: i/o timeout

But when I do an alloc-status on a node in the pending state it still thinks the box is up:

nomad alloc-status 7a649641                                                                           deploy master=
ID                  = 7a649641
Eval ID             = 91eb73c3
Name                = billingsvc-stag.app[0]
Node ID             = bdd69afa
Job ID              = billingsvc-stag
Job Version         = 19
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 08/21/17 17:34:28 EDT

Couldn't retrieve stats (HINT: ensure Client.Advertise.HTTP is set): Get https://billing-03d.stag:4646/v1/client/allocation/7a649641-d25f-6fc6-b559-66f345e83c4c/stats: dial tcp 10.1.200.39:4646: i/o timeout

Thanks!

dadgar commented 7 years ago

Hey @dansteen,

Do you have reproduction steps? Did you do anything of note because I have never seen this before and that code hasn't changed in a very long time.

If you have server logs in debug mode could you share those (grep for TTL expiring https://github.com/hashicorp/nomad/blob/master/nomad/heartbeat.go#L92 of that node).

dansteen commented 7 years ago

Hi @dadgar!

It happened for any box that was removed for a while, and then I did a full nomad server cluster shutdown and restart, and it is no longer happening to new removals (though it is still an issue for boxes that were removed prior to the restart).

I have put my cluster into DEBUG mode, and it is definitely throwing those error lines. Here is the one for the node mentioned above:

2017-08-17T18:05:21.010249-04:00 nomad-0ea.stag.awse nomad[6161]:     2017/08/17 18:05:21.009753 [WARN] nomad.heartbeat: node 'bdd69afa-28b0-0966-ccd7-abb6a2bcf9a3' TTL expired

I don't see a whole lot of additional information around that though. Is there something specific that you would like to see?

I've even tried to completely stop the nomad job that is running on that node (that would be bad in production, but we are not running nomad in production yet), and it registers the stop request, and changes the "desired" state to stop, but the job still stays in pending.

As I mentioned, this is our testing environment, so I can do any sort of destructive testing you think would be helpful. Obviously, I could stop all jobs, clear the data folder and start nomad up again, but that would not really resolve the underlying issue, and I would not have that option if this happens again once we move out to production.

Thanks!

dadgar commented 7 years ago

@dansteen So even when you see the message that the TTL expired, if you do node-status you see the node as being alive?

dansteen commented 7 years ago

@dadgar yes correct.

dadgar commented 7 years ago

Ok so the things that happened were:

1) Have a functional cluster (Version 0.6.0?) 2) Run jobs and everything is good 3) Kill certain nodes 4) Wait a while 5) Restart just the nomad servers 6) Dead nodes are marked as alive?

Can you reproduce this or it has only happened once?

dansteen commented 7 years ago

Mostly. More holistically, from the very start of having issues it was:

  1. Have a functioning cluster (Nomad 0.5.6, Consul 0.8.2, Vault 0.7.2)
  2. Run jobs and everything is good
  3. Update to 0.6.0
  4. Run jobs and everything is good for several weeks
  5. Have some strange leader election issues resulting in job creation and allocation failures. (nomad would tell me Error deregistering job: Unexpected response code: 500 (rpc error: leadership lost while committing log) - At the time, I thought they were vault errors since I was also seeing messages about not being able to get secrets from vault, and i was seeing some odd messages in the vault logs as well. I've since discovered that they were nomad issues and I've pasted output from the logs below. However, since I thought they were vault errors at the time, I did the next step)
  6. Update consul from 0.8.2 to 0.9.2 and vault from 0.7.2 to 0.8.1
  7. Run jobs and everything is good - this is without restarting nomad
  8. Kill a bunch of nodes
  9. Wait a while
  10. Notice they don't die
  11. Try a bunch of things like stopping other clients, stopping just consul on a client box, stopping just nomad on a client box, stopping both consul and nomad on a client box. Nothing results in a box being marked "down".
  12. Restart just the nomad servers
  13. Kill a bunch of nodes
  14. Newly killed nodes show up as "down" but nodes that were killed before but showing up as alive continue to show up as alive.

Here is the log from step 5 above. There were many thousands of lines like this:

2017-08-19T00:28:15.011279-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:15.011021 [INFO] nomad: cluster leadership acquired
2017-08-19T00:28:15.915208-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:15.914615 [INFO] nomad: cluster leadership lost
2017-08-19T00:28:15.916221-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:15.915657 [ERR] nomad: failed to wait for barrier: leadership lost while committing log
2017-08-19T00:28:17.138894-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:17.138687 [INFO] nomad: cluster leadership acquired
2017-08-19T00:28:45.527617-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:45.525341 [INFO] nomad: cluster leadership lost
2017-08-19T00:28:47.276698-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:47.276243 [INFO] nomad: cluster leadership acquired
2017-08-19T00:28:48.237477-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:48.236104 [INFO] nomad: cluster leadership lost
2017-08-19T00:28:48.238539-04:00 nomad-0ea.stag.awse nomad[32067]:     2017/08/19 00:28:48.236436 [ERR] nomad: failed to establish leadership: unable to reconcile job summaries: reconciliation of job summaries failed: leadership lost while committing log

Note, that I am still seeing log messages like this fairly frequently:

2017-08-22T09:40:24.268922-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:40:24.268817 [INFO] nomad: cluster leadership acquired
2017-08-22T09:40:29.319804-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:40:29.319764 [INFO] nomad: cluster leadership lost
2017-08-22T09:43:48.082172-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:43:48.082030 [INFO] nomad: cluster leadership acquired
2017-08-22T09:44:01.479908-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:44:01.479869 [INFO] nomad: cluster leadership lost
2017-08-22T09:45:35.304028-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:45:35.303825 [INFO] nomad: cluster leadership acquired
2017-08-22T09:45:48.789384-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:45:48.789346 [INFO] nomad: cluster leadership lost
2017-08-22T09:48:59.592430-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:48:59.592345 [INFO] nomad: cluster leadership acquired
2017-08-22T09:49:20.159920-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:49:20.159881 [INFO] nomad: cluster leadership lost
2017-08-22T09:52:00.311925-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:52:00.311785 [INFO] nomad: cluster leadership acquired
2017-08-22T09:52:17.884194-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:52:17.881930 [INFO] nomad: cluster leadership lost
2017-08-22T09:52:19.581303-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:52:19.580852 [INFO] nomad: cluster leadership acquired
2017-08-22T09:52:53.549984-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:52:53.545483 [INFO] nomad: cluster leadership lost
2017-08-22T09:58:28.950476-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:58:28.950355 [INFO] nomad: cluster leadership acquired
2017-08-22T09:58:49.019037-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:58:49.018996 [INFO] nomad: cluster leadership lost
2017-08-22T09:58:50.883344-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:58:50.883263 [INFO] nomad: cluster leadership acquired
2017-08-22T09:59:24.440112-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:59:24.440067 [INFO] nomad: cluster leadership lost
2017-08-22T09:59:26.053275-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 09:59:26.053220 [INFO] nomad: cluster leadership acquired
2017-08-22T10:01:09.950390-04:00 nomad-0ea.stag.awse nomad[26440]:     2017/08/22 10:01:09.947525 [INFO] nomad: cluster leadership lost

But I don't see the ERR messages anymore - although, those went away after I updated consul and vault and before I did a full restart of nomad.

Sorry about not providing the full context earlier. -Dan

dadgar commented 7 years ago

Thanks for the additional detail. It may have been the nodes went down during the leader election and then nothing handled them!

dansteen commented 7 years ago

Seems very likely considering that leader elections was going on every couple seconds. Is there anyway to force nodes to the "down" state? In the meantime, I have put those nodes into the node-drain state so that my allocations don't fail, but I'd like to get rid of them if I can.

dadgar commented 7 years ago

@dansteen There isn't really an easy way (no api). Can you also give me the output of nomad node-status -json <id> of one of the bad nodes. If you shutdown the leader node it really should just handle this on the next leader election.

On leader election all timers for healthy nodes are reset and if the heartbeat doesn't occur by then the node is transistioned to down: https://github.com/hashicorp/nomad/blob/master/nomad/heartbeat.go#L15-L106

dansteen commented 7 years ago

@dadgar here is the node information:

{
    "Attributes": {
        "driver.java.runtime": "Java(TM) SE Runtime Environment (build 1.8.0_31-b13)",
        "unique.platform.aws.local-ipv4": "10.1.200.172",
        "cpu.arch": "amd64",
        "platform.aws.placement.availability-zone": "us-east-1a",
        "vault.cluster_id": "982fc9ce-c941-053d-5c72-ad096b1feb4c",
        "unique.network.ip-address": "10.1.200.172",
        "unique.cgroup.mountpoint": "/sys/fs/cgroup",
        "os.signals": "SIGPROF,SIGSEGV,SIGXFSZ,SIGCONT,SIGUSR2,SIGSTOP,SIGTERM,SIGTRAP,SIGTTIN,SIGUSR1,SIGFPE,SIGQUIT,SIGTSTP,SIGCHLD,SIGXCPU,SIGILL,SIGINT,SIGSYS,SIGKILL,SIGPIPE,SIGABRT,SIGALRM,SIGBUS,SIGHUP,SIGIO,SIGIOT,SIGTTOU,SIGURG",
        "unique.storage.bytestotal": "16775008256",
        "consul.revision": "6017484",
        "cpu.modelname": "Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz",
        "cpu.frequency": "2500",
        "driver.exec": "1",
        "vault.accessible": "true",
        "cpu.totalcompute": "2500",
        "os.name": "debian",
        "kernel.name": "linux",
        "vault.version": "0.7.2",
        "os.version": "8.6",
        "unique.platform.aws.instance-id": "i-079fca5d9940a7a78",
        "nomad.revision": "139a96aee385e82c789789be228343d4d52349b6+CHANGES",
        "consul.datacenter": "awse",
        "unique.platform.aws.hostname": "ip-10-1-200-172.ec2.internal",
        "unique.storage.bytesfree": "11136200704",
        "unique.hostname": "utility-079",
        "driver.java.version": "1.8.0_31",
        "platform.aws.instance-type": "m3.medium",
        "consul.server": "false",
        "nomad.version": "0.6.0",
        "kernel.version": "4.7.0-0.bpo.1-amd64",
        "unique.storage.volume": "/dev/xvda2",
        "driver.java.vm": "Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)",
        "unique.platform.aws.local-hostname": "ip-10-1-200-172.ec2.internal",
        "driver.java": "1",
        "memory.totalbytes": "3949461504",
        "vault.cluster_name": "vault-cluster-955c9a4c",
        "cpu.numcores": "1",
        "consul.version": "0.8.2",
        "unique.platform.aws.ami-id": "ami-2c662e3b",
        "unique.consul.name": "utility-079.stag.awse"
    },
    "CreateIndex": 90304,
    "Datacenter": "awse",
    "Drain": true,
    "HTTPAddr": "utility-079.stag.awse:4646",
    "ID": "ec28080d-22cf-feda-ba84-b32783b18a5c",
    "Links": {
        "consul": "awse.utility-079.stag.awse",
        "aws.ec2": "us-east-1a.i-079fca5d9940a7a78"
    },
    "Meta": {
        "role": "utility",
        "env": "stag",
        "security_posture": "stag"
    },
    "ModifyIndex": 346790,
    "Name": "utility-079.stag.awse",
    "NodeClass": "",
    "Reserved": {
        "CPU": 0,
        "DiskMB": 0,
        "IOPS": 0,
        "MemoryMB": 0,
        "Networks": null
    },
    "Resources": {
        "CPU": 2500,
        "DiskMB": 10620,
        "IOPS": 0,
        "MemoryMB": 3766,
        "Networks": [
            {
                "CIDR": "10.1.200.172/32",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "10.1.200.172",
                "MBits": 400,
                "ReservedPorts": null
            }
        ]
    },
    "Status": "ready",
    "StatusDescription": "",
    "StatusUpdatedAt": 1503188876,
    "TLSEnabled": true
}
dansteen commented 7 years ago

@dadgar Ok, interesting bit of additional information. In some cases I am using reserved ports, however, I build my configs using chef, so by way of standardization, I included the following in my nomad client config in cases where I did not have any specific port reservations:

reserved {
    reserved_ports = "0-20000,32000-65535"
}

(I realize this is just allowing the default set of ports that nomad was allowing anyway, but it was just by way of standardization, and I figured it wouldn't make any difference).

While generating node information for you in the response above, I realized that this actually resulted in ~53000 entries in the Resources: { Networks: { ReservedPorts: ... } } } array in the node data. And that it bumped the size of the node data from 5k to 6M. And that nomad was then trying to pass around 6M of data for every single node in my cluster!

Once I removed that from the config, the node data size went back down, and my cluster cleaned itself up on its own - no more zombie nodes!

One final interesting thing is that it was the "live" nodes that had 6M node-data. The dead nodes that wouldn't go away all had the usual 5k (because they had been shut-down as part of the same update that added in all those reserved ports).

Anyway, this was an interesting lesson in how this all works, but my sense is that, due to the way the syntax is for port reservations, this is the sort of thing other people will run into and maybe some sort of warnings or something might be in order?

Thanks for your help!

dadgar commented 7 years ago

@dansteen Nice debugging. I am going to retitle the issue to track that problem. We need a way to be able to just pass ranges rather than actually creating an object per port.

evandam commented 4 years ago

@dadgar are there any updates on this issue?

dadgar commented 4 years ago

@evandam This should be fixed in 0.9.0. Are you facing this issue in a recent release?

Fix https://github.com/hashicorp/nomad/pull/4750/commits/f96929885492e1fc3020321554b12df9f2b19fc0#diff-396d9428285e4245041a3d0b15546d5dR2101

evandam commented 4 years ago

Gotcha, thanks! I'm not seeing it but saw this issue still open and wanted to verify before running it in production 😄

dadgar commented 4 years ago

@evandam Awesome! Thanks for the update. I am going to close this issue then, I think we just missed it when the linked PR got merged!

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.