hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.86k stars 1.95k forks source link

Template: Fail job on missing keys #3462

Open felka opened 6 years ago

felka commented 6 years ago

Nomad version

0.7.0-rc3

Operating system and Environment details

ubuntu 16.04

Issue

We are working on moving our cronjobs to Nomad. It includes scheduling scala jars using native exec (no containers). The job itself contains template stanza which renders a job config file with keys from consul before execution. If a key is missing the job will get stuck in pending state and prevents other allocations from running. However when adding new jobs they seem to be running while older pending jobs doesn't get allocated. There is also an issue in the UI which is showing the jobs which are stuck in "running" state

Reproduction steps

Adding 100 new jobs with missing key to have running and pending jobs Waiting until some failing jobs will be allocated Add new working jobs without missing consul key to test allocation

Error in UI

rv-job pending Missing: kv.block(path_to_key/producerConfigs)

Job file

{
  "Job": {
    "AllAtOnce": false,
    "Constraints": null,
    "CreateIndex": 34811,
    "Datacenters": [
      "us-east-1b",
      "us-east-1d",
      "us-east-1e"
    ],
    "ID": "dev118-rv-job",
    "JobModifyIndex": 34811,
    "Meta": null,
    "ModifyIndex": 34811,
    "Name": "dev118-rv-job",
    "Namespace": "default",
    "ParameterizedJob": null,
    "ParentID": "",
    "Payload": null,
    "Periodic": null,
    "Priority": 50,
    "Region": "global",
    "Stable": false,
    "Status": "pending",
    "StatusDescription": "",
    "Stop": false,
    "SubmitTime": 1509292961781137149,
    "TaskGroups": [
      {
        "Constraints": null,
        "Count": 1,
        "EphemeralDisk": {
          "Migrate": false,
          "SizeMB": 300,
          "Sticky": false
        },
        "Meta": null,
        "Name": "rv-job",
        "RestartPolicy": {
          "Attempts": 0,
          "Delay": 25,
          "Interval": 600000000000,
          "Mode": "fail"
        },
        "Tasks": [
          {
            "Artifacts": [
              {
                "GetterMode": "file",
                "GetterOptions": null,
                "GetterSource": "http://localhost:8500/v1/kv/nomad-jobs/dev42/rv-job/sonic-templates/c5cbd9b297af0cecc23c02dbba1440fe6e4b182a?raw",
                "RelativeDest": "local/rv-job.conf.tpl"
              },
              {
                "GetterMode": "any",
                "GetterOptions": null,
                "GetterSource": "s3::https://s3.amazonaws.com/file.jar",
                "RelativeDest": "local/"
              }
            ],
            "Config": {
              "command": "/usr/bin/java",
              "args": [
                "-noverify",
                "-Dlogback.configurationFile=local/logback.xml",
                "-DAPPNAME=${appname}",
                "-Xms8192m",
                "-Xmx8192m",
                "-DBRANCH=${appenv}",
                "-Djava.library.path=/var/lib/sonic/lib",
                "-Djava.io.tmpdir=/tmp/",
                "-Dconfig.file=local/${appname}.conf",
                "-cp",
                "local/${commit}.jar",
                "com.supersonic.allocator.AllocatorUpdateProcessMain",
                "rv"
              ]
            },
            "Constraints": null,
            "DispatchPayload": null,
            "Driver": "exec",
            "Env": {
              "appname": "rv-job",
              "submitted": "2017-10-29T16:01:09Z",
              "appenv": "dev42",
              "commit": "c5cbd9b297af0cecc23c02dbba1440fe6e4b182a"
            },
            "KillTimeout": 5000000000,
            "Leader": false,
            "LogConfig": {
              "MaxFileSizeMB": 10,
              "MaxFiles": 10
            },
            "Meta": null,
            "Name": "rv-job",
            "Resources": {
              "CPU": 2048,
              "DiskMB": 0,
              "IOPS": 0,
              "MemoryMB": 8192,
              "Networks": null
            },
            "Services": null,
            "ShutdownDelay": 0,
            "Templates": [
              {
                "ChangeMode": "restart",
                "ChangeSignal": "",
                "DestPath": "local/rv-job.conf",
                "EmbeddedTmpl": "",
                "Envvars": false,
                "LeftDelim": "{{",
                "Perms": "0644",
                "RightDelim": "}}",
                "SourcePath": "local/rv-job.conf.tpl",
                "Splay": 5000000000,
                "VaultGrace": 15000000000
              }
            ],
            "User": "",
            "Vault": null
          }
        ],
        "Update": null
      }
    ],
    "Type": "batch",
    "Update": {
      "AutoRevert": false,
      "Canary": 0,
      "HealthCheck": "",
      "HealthyDeadline": 0,
      "MaxParallel": 0,
      "MinHealthyTime": 0,
      "Stagger": 0
    },
    "VaultToken": "",
    "Version": 0
  }
}
preetapan commented 6 years ago

@felka can you provide the template (even a sanitized version is fine). Nomad uses consul-template which does a blocking query to read the key. I suspect that this is causing the template to never finish rendering. This is expected behavior though given how consul template works.

felka commented 6 years ago

@preetapan Thanks for prompt answer! I understand it is expected behavior. Also we are using 0.14 consul template which has non blocking query in "key". However, I think this behavior should has a config to allow job to fail instead of being in pending/running state and using allocation. Blocking allocation because of missing key seems to me too aggressive as a default.

mikesimons commented 6 years ago

I would definitely like the option for the nomad job to fail if keys are missing (maybe as an extra param to the template stanza).

I had the same issue while evaluating nomad and resorted to doing a pre-flight check on the manifest using this script.

If the output is not empty then you have missing keys and can throw an error before attempting the deploy.

preetapan commented 6 years ago

@felka - Do you use keyOrDefault in your template? https://github.com/hashicorp/consul-template#keyordefault - that should let it progress with a default value instead of a blocking query.

As I understand it, the details of your template are opaque to Nomad and it only executes it. Consul template has three options - a blocking query, a keyExists check so that you can do flow control based on whether a key is present, and the keyOrDefault which lets you provide a default value. If the template execution does not terminate, Nomad cannot determine the right behavior. For example, what if you want to wait for something else to populate that key so that the config file for the job can be populated correctly, and the job can then run after that? In that case, its acceptable to block allocations until its ready to proceed.

That being said, we will discuss this internally to see what else we can do here.

burdandrei commented 5 years ago

how about adding error_on_missing_key options to template stanza?

Legogris commented 3 years ago

It seems this should be supported in consul-template now? Not sure if the issue is here or in https://github.com/hashicorp/terraform-provider-nomad but attempting to use it in template:

template -> invalid key: error_on_missing_key

mikenomitch commented 2 years ago

Since Consul Template supports error_on_missing_key, we should be able to support this by adding the key to the Consul Template config struct and threading the value through to Consul Template.

I am going to add a help-wanted and good-first-issue label in case somebody wants to take this!