hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

Plan returns an erroneous in-place update in diff when task group has a constraint #10836

Closed mjm closed 3 years ago

mjm commented 3 years ago

Nomad version

Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Operating system and Environment details

Ubuntu 20.04.2 LTS

Issue

I have some deploy automation for my Nomad cluster that for each job first runs a plan to see if the job has any changes that need to be applied. I've noticed that for some of my jobs, the plan always has type "Edited" even when there are no changes. If I look in the "Versions" tab for the job in the UI, it lists that version as having "0 changes".

Here's an example of the response from the plan endpoint:

{
   "Annotations" : {
      "DesiredTGUpdates" : {
         "tripplite-exporter" : {
            "Canary" : 0,
            "DestructiveUpdate" : 0,
            "Ignore" : 0,
            "InPlaceUpdate" : 1,
            "Migrate" : 0,
            "Place" : 0,
            "Preemptions" : 0,
            "Stop" : 0
         }
      },
      "PreemptedAllocs" : null
   },
   "CreatedEvals" : null,
   "Diff" : {
      "Fields" : null,
      "ID" : "tripplite-exporter",
      "Objects" : null,
      "TaskGroups" : [
         {
            "Fields" : null,
            "Name" : "tripplite-exporter",
            "Objects" : null,
            "Tasks" : [
               {
                  "Annotations" : null,
                  "Fields" : null,
                  "Name" : "tripplite-exporter",
                  "Objects" : null,
                  "Type" : "None"
               }
            ],
            "Type" : "Edited",
            "Updates" : {
               "in-place update" : 1
            }
         }
      ],
      "Type" : "Edited"
   },
   "FailedTGAllocs" : null,
   "JobModifyIndex" : 409749,
   "NextPeriodicLaunch" : "0001-01-01T00:00:00Z",
   "Warnings" : ""
}

This is happening with 3 of my jobs, and the one thing I've noticed they all have in common is that they all have a constraint to only place them on a particular node. My other jobs are not exhibiting this behavior and don't have this constraint on a task group.

        {
          "LTarget": "${node.unique.name}",
          "RTarget": "raspberrypi",
          "Operand": "="
        }

I tried digging through the code for planning jobs but I got a bit lost trying to figure out where this kind of decision was made.

Reproduction steps

Register a job with a single task group with a constraint like the one shown above. Request a plan for the job without any changes.

Expected Result

The plan has diff type "None" because nothing has changed.

Actual Result

The plan has diff type "Edited" due to an in-place update.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

mjm commented 3 years ago

I just tried moving constraint up to the job instead of the task group and it fixes this behavior. For me, that's a decent workaround since these jobs only have a single task group anyway.

shoenig commented 3 years ago

Hi @mjm, thanks for reporting!

So far I haven't been able to reproduce what you're seeing - but it might just not be the group constraint that's actually the problem? It might be helpful if you could post the output from the CLI when running nomad job plan -diff -verbose <job>, or just one of the symptomatic job files.

Here's the job I'm submitting:

job "example" {
  datacenters = ["dc1"]

  group "sleep" {

    constraint {
      operator  = "="
      attribute = "${node.unique.name}"
      value     = "laptop"
    }

    task "sleep" {
      driver = "exec"

      config {
    command = "/bin/sleep"
    args = ["100"]
      }
    }
  }
}
mjm commented 3 years ago

Thanks for looking at this! Here's the JSON version of one of these jobs. I have some code that does some transformations on the original HCL, so this JSON version is what actually gets applied. I have to believe that the constraint is somehow relevant (maybe it's not the whole story) because moving it up to the job does fix the problem.

{
  "Region": null,
  "Namespace": null,
  "ID": "tripplite-exporter",
  "Name": "tripplite-exporter",
  "Type": "system",
  "Priority": 70,
  "AllAtOnce": null,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": null,
  "Affinities": null,
  "TaskGroups": [
    {
      "Name": "tripplite-exporter",
      "Count": 1,
      "Constraints": [
        {
          "LTarget": "${node.unique.name}",
          "RTarget": "raspberrypi",
          "Operand": "="
        }
      ],
      "Affinities": null,
      "Tasks": [
        {
          "Name": "tripplite-exporter",
          "Driver": "docker",
          "User": "",
          "Lifecycle": null,
          "Config": {
            "command": "/tripplite_exporter",
            "image": "index.docker.io/mmoriarity/tripplite-exporter@sha256:c955272aa83f9eccfe461a8b96ef8f299e13b3cb71a7a7bcad5db6376d27ace6",
            "logging": {
              "config": [
                {
                  "tag": "tripplite-exporter"
                }
              ],
              "type": "journald"
            },
            "mount": [
              {
                "source": "/dev/bus/usb",
                "target": "/dev/bus/usb",
                "type": "bind"
              }
            ],
            "ports": [
              "http"
            ],
            "privileged": true
          },
          "Constraints": null,
          "Affinities": null,
          "Env": {
            "HOSTNAME": "${attr.unique.hostname}",
            "HOST_IP": "${attr.unique.network.ip-address}",
            "NOMAD_CLIENT_ID": "${node.unique.id}"
          },
          "Services": null,
          "Resources": {
            "CPU": 30,
            "MemoryMB": 30,
            "DiskMB": null,
            "Networks": null,
            "Devices": null,
            "IOPS": null
          },
          "RestartPolicy": null,
          "Meta": null,
          "KillTimeout": null,
          "LogConfig": null,
          "Artifacts": null,
          "Vault": null,
          "Templates": null,
          "DispatchPayload": null,
          "VolumeMounts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "KillSignal": "",
          "Kind": "",
          "ScalingPolicies": null
        }
      ],
      "Spreads": null,
      "Volumes": null,
      "RestartPolicy": null,
      "ReschedulePolicy": null,
      "EphemeralDisk": null,
      "Update": null,
      "Migrate": null,
      "Networks": [
        {
          "Mode": "",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "DNS": {
            "Servers": [
              "10.0.2.101"
            ],
            "Searches": null,
            "Options": null
          },
          "ReservedPorts": null,
          "DynamicPorts": [
            {
              "Label": "http",
              "Value": 0,
              "To": 8080,
              "HostNetwork": ""
            }
          ],
          "MBits": null
        }
      ],
      "Meta": null,
      "Services": [
        {
          "Id": "",
          "Name": "tripplite-exporter",
          "Tags": null,
          "CanaryTags": null,
          "EnableTagOverride": false,
          "PortLabel": "http",
          "AddressMode": "",
          "Checks": [
            {
              "Id": "",
              "Name": "",
              "Type": "http",
              "Command": "",
              "Args": null,
              "Path": "/healthz",
              "Protocol": "",
              "PortLabel": "",
              "Expose": false,
              "AddressMode": "",
              "Interval": 30000000000,
              "Timeout": 5000000000,
              "InitialStatus": "",
              "TLSSkipVerify": false,
              "Header": null,
              "Method": "",
              "CheckRestart": null,
              "GRPCService": "",
              "GRPCUseTLS": false,
              "TaskName": "",
              "SuccessBeforePassing": 3,
              "FailuresBeforeCritical": 0
            }
          ],
          "CheckRestart": null,
          "Connect": null,
          "Meta": {
            "metrics_path": "/metrics"
          },
          "CanaryMeta": null,
          "TaskName": ""
        }
      ],
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "Scaling": null
    }
  ],
  "Update": null,
  "Multiregion": null,
  "Spreads": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Reschedule": null,
  "Migrate": null,
  "Meta": null,
  "ConsulToken": null,
  "VaultToken": null,
  "Stop": null,
  "ParentID": null,
  "Dispatched": false,
  "Payload": null,
  "VaultNamespace": null,
  "NomadTokenID": null,
  "Status": null,
  "StatusDescription": null,
  "Stable": null,
  "Version": null,
  "SubmitTime": null,
  "CreateIndex": null,
  "ModifyIndex": null,
  "JobModifyIndex": null
}
shoenig commented 3 years ago

Hi @mjm, so far I still haven't reproduced what you're seeing, however I did notice one interesting thing: in your plan output we see

"NextPeriodicLaunch" : "0001-01-01T00:00:00Z",

but when I submit a similar job, get the JSON from inspect, and submit it for planning, I always get

"NextPeriodicLaunch":null,

I don't know if that's actually related, but it seems suspicious. Did a job of this name once exist as a periodic job?

mjm commented 3 years ago

I had 3 different jobs affected by this. One is a periodic batch job, one is a batch job I trigger manually with a dispatch payload when necessary, and the other is a system job (that's the one I included here). They've all been those types of jobs from the beginning as far as I remember.

The JSON I got there is coming from some Go code that interacts with the Nomad API to plan and submit jobs, rather than the nomad CLI tool. So that JSON is produced by json.Marshaling the JobPlanResponse type. The NextPeriodicLaunch field is a time.Time, not a pointer, so I think that value I have is just Go's zero value for that type.

luckymike commented 3 years ago

@shoenig this appears to be similar to the issue reported in #9804

shoenig commented 3 years ago

Thanks for pointing that out @luckymike, indeed there does seem to be a problem mixing system jobs with constraints. I'm finally able to reproduce the symptom here, in fact all I needed to do was run my same sample job above but on a cluster with more than one client :grimacing:

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.