Closed mjm closed 3 years ago
I just tried moving constraint up to the job instead of the task group and it fixes this behavior. For me, that's a decent workaround since these jobs only have a single task group anyway.
Hi @mjm, thanks for reporting!
So far I haven't been able to reproduce what you're seeing - but it might just not be the group constraint that's actually the problem? It might be helpful if you could post the output from the CLI when running nomad job plan -diff -verbose <job>
, or just one of the symptomatic job files.
Here's the job I'm submitting:
job "example" {
datacenters = ["dc1"]
group "sleep" {
constraint {
operator = "="
attribute = "${node.unique.name}"
value = "laptop"
}
task "sleep" {
driver = "exec"
config {
command = "/bin/sleep"
args = ["100"]
}
}
}
}
Thanks for looking at this! Here's the JSON version of one of these jobs. I have some code that does some transformations on the original HCL, so this JSON version is what actually gets applied. I have to believe that the constraint is somehow relevant (maybe it's not the whole story) because moving it up to the job does fix the problem.
{
"Region": null,
"Namespace": null,
"ID": "tripplite-exporter",
"Name": "tripplite-exporter",
"Type": "system",
"Priority": 70,
"AllAtOnce": null,
"Datacenters": [
"dc1"
],
"Constraints": null,
"Affinities": null,
"TaskGroups": [
{
"Name": "tripplite-exporter",
"Count": 1,
"Constraints": [
{
"LTarget": "${node.unique.name}",
"RTarget": "raspberrypi",
"Operand": "="
}
],
"Affinities": null,
"Tasks": [
{
"Name": "tripplite-exporter",
"Driver": "docker",
"User": "",
"Lifecycle": null,
"Config": {
"command": "/tripplite_exporter",
"image": "index.docker.io/mmoriarity/tripplite-exporter@sha256:c955272aa83f9eccfe461a8b96ef8f299e13b3cb71a7a7bcad5db6376d27ace6",
"logging": {
"config": [
{
"tag": "tripplite-exporter"
}
],
"type": "journald"
},
"mount": [
{
"source": "/dev/bus/usb",
"target": "/dev/bus/usb",
"type": "bind"
}
],
"ports": [
"http"
],
"privileged": true
},
"Constraints": null,
"Affinities": null,
"Env": {
"HOSTNAME": "${attr.unique.hostname}",
"HOST_IP": "${attr.unique.network.ip-address}",
"NOMAD_CLIENT_ID": "${node.unique.id}"
},
"Services": null,
"Resources": {
"CPU": 30,
"MemoryMB": 30,
"DiskMB": null,
"Networks": null,
"Devices": null,
"IOPS": null
},
"RestartPolicy": null,
"Meta": null,
"KillTimeout": null,
"LogConfig": null,
"Artifacts": null,
"Vault": null,
"Templates": null,
"DispatchPayload": null,
"VolumeMounts": null,
"Leader": false,
"ShutdownDelay": 0,
"KillSignal": "",
"Kind": "",
"ScalingPolicies": null
}
],
"Spreads": null,
"Volumes": null,
"RestartPolicy": null,
"ReschedulePolicy": null,
"EphemeralDisk": null,
"Update": null,
"Migrate": null,
"Networks": [
{
"Mode": "",
"Device": "",
"CIDR": "",
"IP": "",
"DNS": {
"Servers": [
"10.0.2.101"
],
"Searches": null,
"Options": null
},
"ReservedPorts": null,
"DynamicPorts": [
{
"Label": "http",
"Value": 0,
"To": 8080,
"HostNetwork": ""
}
],
"MBits": null
}
],
"Meta": null,
"Services": [
{
"Id": "",
"Name": "tripplite-exporter",
"Tags": null,
"CanaryTags": null,
"EnableTagOverride": false,
"PortLabel": "http",
"AddressMode": "",
"Checks": [
{
"Id": "",
"Name": "",
"Type": "http",
"Command": "",
"Args": null,
"Path": "/healthz",
"Protocol": "",
"PortLabel": "",
"Expose": false,
"AddressMode": "",
"Interval": 30000000000,
"Timeout": 5000000000,
"InitialStatus": "",
"TLSSkipVerify": false,
"Header": null,
"Method": "",
"CheckRestart": null,
"GRPCService": "",
"GRPCUseTLS": false,
"TaskName": "",
"SuccessBeforePassing": 3,
"FailuresBeforeCritical": 0
}
],
"CheckRestart": null,
"Connect": null,
"Meta": {
"metrics_path": "/metrics"
},
"CanaryMeta": null,
"TaskName": ""
}
],
"ShutdownDelay": null,
"StopAfterClientDisconnect": null,
"Scaling": null
}
],
"Update": null,
"Multiregion": null,
"Spreads": null,
"Periodic": null,
"ParameterizedJob": null,
"Reschedule": null,
"Migrate": null,
"Meta": null,
"ConsulToken": null,
"VaultToken": null,
"Stop": null,
"ParentID": null,
"Dispatched": false,
"Payload": null,
"VaultNamespace": null,
"NomadTokenID": null,
"Status": null,
"StatusDescription": null,
"Stable": null,
"Version": null,
"SubmitTime": null,
"CreateIndex": null,
"ModifyIndex": null,
"JobModifyIndex": null
}
Hi @mjm, so far I still haven't reproduced what you're seeing, however I did notice one interesting thing: in your plan output we see
"NextPeriodicLaunch" : "0001-01-01T00:00:00Z",
but when I submit a similar job, get the JSON from inspect, and submit it for planning, I always get
"NextPeriodicLaunch":null,
I don't know if that's actually related, but it seems suspicious. Did a job of this name once exist as a periodic job?
I had 3 different jobs affected by this. One is a periodic batch job, one is a batch job I trigger manually with a dispatch payload when necessary, and the other is a system job (that's the one I included here). They've all been those types of jobs from the beginning as far as I remember.
The JSON I got there is coming from some Go code that interacts with the Nomad API to plan and submit jobs, rather than the nomad
CLI tool. So that JSON is produced by json.Marshal
ing the JobPlanResponse
type. The NextPeriodicLaunch
field is a time.Time
, not a pointer, so I think that value I have is just Go's zero value for that type.
@shoenig this appears to be similar to the issue reported in #9804
Thanks for pointing that out @luckymike, indeed there does seem to be a problem mixing system jobs with constraints. I'm finally able to reproduce the symptom here, in fact all I needed to do was run my same sample job above but on a cluster with more than one client :grimacing:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Operating system and Environment details
Ubuntu 20.04.2 LTS
Issue
I have some deploy automation for my Nomad cluster that for each job first runs a plan to see if the job has any changes that need to be applied. I've noticed that for some of my jobs, the plan always has type "Edited" even when there are no changes. If I look in the "Versions" tab for the job in the UI, it lists that version as having "0 changes".
Here's an example of the response from the plan endpoint:
This is happening with 3 of my jobs, and the one thing I've noticed they all have in common is that they all have a constraint to only place them on a particular node. My other jobs are not exhibiting this behavior and don't have this constraint on a task group.
I tried digging through the code for planning jobs but I got a bit lost trying to figure out where this kind of decision was made.
Reproduction steps
Register a job with a single task group with a constraint like the one shown above. Request a plan for the job without any changes.
Expected Result
The plan has diff type "None" because nothing has changed.
Actual Result
The plan has diff type "Edited" due to an in-place update.
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)