nomad allocates all RAM on server and crashes the box

jippi commented 8 years ago

Hi,

Running 0.4.1 i'm seeing some weird behavior from nomad - it happens ~daily on our web servers

The result of the behavior is the server becomes unresponsive for ~10min while things OOM and then slowly recover again. No allocations or changes are done during these outages, the one in the logs now happened on a saturday with no one working or being logged into the systems.

Also observing that the node is in ready mode, but system jobs will not actually restart on the node (Job + Allocation) - restaring nomad makes the allocation succeed again - https://gist.github.com/jippi/046840d5c6c65b4e0e1ea32ea2424242

Log (with debug on) https://gist.github.com/jippi/95b88ef66fd592206406ba9d312ca228

Interesting enough, the two clients that this behavior happen on is physical servers, where the $x other clients in the cluster, running inside kvm, don't act up like this.

They are provisioned identically with puppet, and their only major differences is physical vs virtual machine, and that the web boxes (which see this issue) also have active docker jobs running. Where the other servers got docker running, but nothing allocated on docker.

Allocation executor logs

https://gist.github.com/jippi/83a32fce9d409a32fa6175b5793d7c2c

config.hcl

bind_addr = "0.0.0.0"
datacenter = "production"
region = "global"
data_dir = "/opt/nomad/data"
log_level = "DEBUG"

advertise {
  http = "???.???.91.111:4646"
  rpc = "???.???.91.111:4647"
  serf = "???.???.91.111:4648"
}

addresses {
  http = "0.0.0.0"
  rpc = "0.0.0.0"
  serf = "0.0.0.0"
}

client {
  enabled = true
  servers = ["nomad.service.bownty:4647"]

  options = {
    "driver.raw_exec.enable" = "1"
  }

  node_class = "web"

  meta {
    "web" = "1"
  }
}

consul {
  address               = "127.0.0.1:8500"

  server_service_name   = "nomad"
  server_auto_join      = true

  client_service_name   = "nomad-client"
  client_auto_join      = true
}

http_api_response_headers {
  Access-Control-Allow-Origin   = "*"
  Access-Control-Expose-Headers = "x-nomad-index"
  Access-Control-Allow-Methods  = "GET, POST, OPTIONS"
}

nomad agent-info

-> nomad agent-info
client
  heartbeat_ttl = 12.593495744s
  known_servers = 3
  last_heartbeat = 10.843486073s
  node_id = 9ff0ea83-ede6-9143-adca-aaed5c3e6553
  num_allocations = 7
runtime
  arch = amd64
  cpu_count = 8
  goroutines = 85
  kernel.name = linux
  max_procs = 5
  version = go1.7

node as seen from /v1/node/:id

{
  "ID": "9ff0ea83-ede6-9143-adca-aaed5c3e6553",
  "Datacenter": "production",
  "Name": "web02",
  "HTTPAddr": "xxx.zzz.91.111:4646",
  "Attributes": {
    "unique.storage.volume": "/dev/disk/by-uuid/0cfc07c4-4b8f-4709-aaad-2ee1a1854762",
    "unique.network.ip-address": "xxx.yyy.91.111",
    "cpu.totalcompute": "27992",
    "driver.java.version": "1.8.0_72",
    "cpu.modelname": "Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz",
    "driver.exec": "1",
    "os.version": "7.9",
    "unique.cgroup.mountpoint": "/sys/fs/cgroup",
    "driver.java.runtime": "Java(TM) SE Runtime Environment (build 1.8.0_72-b15)",
    "driver.java.vm": "Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)",
    "driver.docker": "1",
    "unique.storage.bytestotal": "1876063666176",
    "driver.raw_exec": "1",
    "driver.java": "1",
    "unique.hostname": "web02",
    "kernel.name": "linux",
    "arch": "amd64",
    "cpu.numcores": "8",
    "kernel.version": "4.7.6",
    "nomad.revision": "'8fdc55e16b54f176a711c115966ba234e8bb7879+CHANGES'",
    "cpu.frequency": "3499",
    "os.name": "debian",
    "unique.storage.bytesfree": "1763363778560",
    "nomad.version": "0.4.1",
    "driver.docker.version": "1.12.1",
    "memory.totalbytes": "33569710080"
  },
  "Resources": {
    "CPU": 27992,
    "MemoryMB": 32014,
    "DiskMB": 1681674,
    "IOPS": 0,
    "Networks": [
      {
        "Device": "eth0",
        "CIDR": "xxx.zzz.91.111/32",
        "IP": "xxx.zzz.91.111",
        "MBits": 1000,
        "ReservedPorts": null,
        "DynamicPorts": null
      }
    ]
  },
  "Reserved": {
    "CPU": 0,
    "MemoryMB": 0,
    "DiskMB": 0,
    "IOPS": 0,
    "Networks": null
  },
  "Links": {

  },
  "Meta": {
    "web": "1"
  },
  "NodeClass": "web",
  "ComputedClass": "v1:1445644767665653020",
  "Drain": false,
  "Status": "ready",
  "StatusDescription": "",
  "StatusUpdatedAt": 1476548513,
  "CreateIndex": 22,
  "ModifyIndex": 12451
}

Example allocation from the server

{
  "ID": "b38a1355-f949-fa7b-1271-06fff182e6c2",
  "EvalID": "e1b5e417-f82e-0347-1d8a-e5344eb5d80e",
  "Name": "insights-web.php-fpm[0]",
  "NodeID": "9ff0ea83-ede6-9143-adca-aaed5c3e6553",
  "JobID": "insights-web",
  "Job": {
    "Region": "global",
    "ID": "insights-web",
    "ParentID": "",
    "Name": "insights-web",
    "Type": "system",
    "Priority": 50,
    "AllAtOnce": false,
    "Datacenters": [
      "production"
    ],
    "Constraints": [
      {
        "LTarget": "${meta.web}",
        "RTarget": "1",
        "Operand": "="
      },
      {
        "LTarget": "",
        "RTarget": "",
        "Operand": "distinct_hosts"
      }
    ],
    "TaskGroups": [
      {
        "Name": "php-fpm",
        "Count": 1,
        "Constraints": null,
        "RestartPolicy": {
          "Attempts": 2,
          "Interval": 60000000000,
          "Delay": 15000000000,
          "Mode": "delay"
        },
        "Tasks": [
          {
            "Name": "server",
            "Driver": "raw_exec",
            "User": "www-data",
            "Config": {
              "args": [
                "--fpm-config=/etc/bownty/insights/php-fpm/manager.conf"
              ],
              "command": "/usr/sbin/php-fpm7.0"
            },
            "Env": null,
            "Services": [
              {
                "Name": "insights-web-php-fpm-server",
                "PortLabel": "",
                "Tags": null,
                "Checks": null
              }
            ],
            "Constraints": null,
            "Resources": {
              "CPU": 500,
              "MemoryMB": 128,
              "DiskMB": 300,
              "IOPS": 0,
              "Networks": null
            },
            "Meta": null,
            "KillTimeout": 5000000000,
            "LogConfig": {
              "MaxFiles": 10,
              "MaxFileSizeMB": 10
            },
            "Artifacts": null
          }
        ],
        "Meta": null
      }
    ],
    "Update": {
      "Stagger": 10000000000,
      "MaxParallel": 1
    },
    "Periodic": null,
    "Meta": null,
    "Status": "running",
    "StatusDescription": "",
    "CreateIndex": 91,
    "ModifyIndex": 99,
    "JobModifyIndex": 91
  },
  "TaskGroup": "php-fpm",
  "Resources": {
    "CPU": 500,
    "MemoryMB": 128,
    "DiskMB": 300,
    "IOPS": 0,
    "Networks": null
  },
  "TaskResources": {
    "server": {
      "CPU": 500,
      "MemoryMB": 128,
      "DiskMB": 300,
      "IOPS": 0,
      "Networks": null
    }
  },
  "Metrics": {
    "NodesEvaluated": 1,
    "NodesFiltered": 0,
    "NodesAvailable": {
      "production": 6
    },
    "ClassFiltered": null,
    "ConstraintFiltered": null,
    "NodesExhausted": 0,
    "ClassExhausted": null,
    "DimensionExhausted": null,
    "Scores": {
      "9ff0ea83-ede6-9143-adca-aaed5c3e6553.binpack": 0.479497471987969
    },
    "AllocationTime": 48246,
    "CoalescedFailures": 0
  },
  "DesiredStatus": "stop",
  "DesiredDescription": "alloc is lost since its node is down",
  "ClientStatus": "failed",
  "ClientDescription": "",
  "TaskStates": {
    "server": {
      "State": "dead",
      "Events": [
        {
          "Type": "Received",
          "Time": 1476434000544104192,
          "RestartReason": "",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Started",
          "Time": 1476434000557040481,
          "RestartReason": "",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Terminated",
          "Time": 1476547483098828050,
          "RestartReason": "",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "unexpected EOF",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Restarting",
          "Time": 1476547541518148797,
          "RestartReason": "Restart within policy",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 17869710516,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Driver Failure",
          "Time": 1476547619247235399,
          "RestartReason": "",
          "DriverError": "failed to start task 'server' for alloc 'b38a1355-f949-fa7b-1271-06fff182e6c2': unable to dispense the executor plugin: EOF",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        },
        {
          "Type": "Not Restarting",
          "Time": 1476547619247389681,
          "RestartReason": "Error was unrecoverable",
          "DriverError": "",
          "ExitCode": 0,
          "Signal": 0,
          "Message": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": ""
        }
      ]
    }
  },
  "CreateIndex": 5301,
  "ModifyIndex": 12460,
  "AllocModifyIndex": 12418,
  "CreateTime": 1476434000485948951
}

Observed from datadog

Observed from newrelic (1)

Observed from newrelic (2)

From NewRelic, the data includes both nomad agent and the different nomad executor instances, I'm unable to split them apart.

dadgar commented 8 years ago

@jippi Fairly positive you hit this bug: https://github.com/hashicorp/nomad/pull/1762.

A short term fix would be to use exec versus raw_exec. I also suggest you reserve some CPU and Memory on the nodes otherwise you are allowing Nomad to allocate the whole machines memory

jippi commented 8 years ago

@dadgar okay, i've reserved some CPU / RAM for nomad now (2.5GHz and 512MB)

Any ETA for a release containing that fix? Also, any suggestion on how I could verify if its indeed that issue?

dadgar commented 8 years ago

@jippi Did you end up verifying? Hopefully in a 1-2 weeks

jippi commented 8 years ago

@dadgar the error happened again today, even though I did an allocation limit on nomad.

I'm honestly not good enough at Go to be confident a custom build would be production grade. If you got time it would be amazing to get a amd64 linux build with the cherry-picked commit and I can test it out - or guidance on how to make a production grade build for amd64 :)

The super odd thing is that it's only two out of 7 boxes that have the issue. Same kernel version and everything, only difference is physical hardware vs virtual kvm server

jippi commented 7 years ago

@dadgar since i cherry-picked the commit you suggested from #1762 i've not observed the issue ! :)

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad

nomad allocates all RAM on server and crashes the box #1817