Open empikls opened 2 years ago
Hi @empikls! The evaluation data should have more details. Right now we don't have a nice and clean single-eval inspect but if you use nomad eval list -json | jq '. | select(.ID | startswith("11fb88cc"))'
that should give you a JSON blob with a bunch more data about where the scheduler is doing something you might not expect.
That being said, I'm looking at this and have a few questions:
datacenters = ["euc1", "fsn1"]
constraint {
attribute = "${node.class}"
operator = "regexp"
value = "(cloud-cpu-worker|object-detection)"
}
constraint {
operator = "distinct_property"
attribute = "${node.datacenter}"
value = "1"
}
system
job? Having distinct_property
of one-per-DC seems to contradict having it as a system job, which is intended to place on every eligible node in the DC.(Aside: I hope you don't mind, but I edited your question to use triple backticks for the code blocks and shell output; that'll make them easier to read.)
@tgross Yes, the job version is 0, and I didn't specify the version before (when I used nomad 1.0.1).
New output from gitlab:
$ for job in $(ls jobs/*); do echo "Deploying $job"; nomad job run $job; done
Deploying jobs/autoscaler.nomad
==> Monitoring evaluation "45d70a55"
Evaluation triggered by job "autoscaler"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "45d70a55" finished with status "complete" but failed to place all allocations:
Task Group "autoscaler" (failed to place 1 allocation):
* Class "cache": 1 nodes excluded by filter
* Class "cloud-cpu-worker": 1 nodes excluded by filter
* Class "stream-processing": 1 nodes excluded by filter
* Class "cloud-cache": 1 nodes excluded by filter
* Constraint "${node.class} regexp (cloud-cpu-worker|object-detection)": 3 nodes excluded by filter
* Constraint "distinct_property: ${node.datacenter}=euc1 used by 1 allocs": 1 nodes excluded by filter
Cleaning up file based variables
00:02
ERROR: Job failed: exit code 1
There you can find JSON blob with a bunch more details :
"ID": "45d70a55-e5b2-5e77-2f95-4c4cd143111e", "JobID": "autoscaler", "JobModifyIndex": 952215, "ModifyIndex": 962495, "ModifyTime": 1644256453590096000, "Namespace": "default", "NextEval": "", "NodeID": "", "NodeModifyIndex": 0, "PreviousEval": "", "Priority": 50, "QueuedAllocations": { "autoscaler": 0 }, "QuotaLimitReached": "", "SnapshotIndex": 962494, "Status": "complete", "StatusDescription": "", "TriggeredBy": "job-register", "Type": "system", "Wait": 0, "WaitUntil": null }
nomad eval status 45d70a55
ID = 45d70a55
Create Time = 23m59s ago
Modify Time = 23m59s ago
Status = complete
Status Description = complete
Type = system
TriggeredBy = job-register
Job ID = autoscaler
Priority = 50
Placement Failures = true
Failed Placements
Task Group "autoscaler" (failed to place 1 allocation):
* Class "cache": 1 nodes excluded by filter
* Class "cloud-cpu-worker": 1 nodes excluded by filter
* Class "stream-processing": 1 nodes excluded by filter
* Class "cloud-cache": 1 nodes excluded by filter
* Constraint "${node.class} regexp (cloud-cpu-worker|object-detection)": 3 nodes excluded by filter
* Constraint "distinct_property: ${node.datacenter}=euc1 used by 1 allocs": 1 nodes excluded by filter
Answering your questions :
1) Now i have 2 allocs of autoscaler job with version 0 and with eval ID 5c0c0297-2306-7642-fa5f-8378e7872cd1. There is JSON blob
{ "AnnotatePlan": false, "BlockedEval": "", "ClassEligibility": null, "CreateIndex": 952215, "CreateTime": 1644242162351499800, "DeploymentID": "", "EscapedComputedClass": false, "FailedTGAllocs": null, "ID": "5c0c0297-2306-7642-fa5f-8378e7872cd1", "JobID": "autoscaler", "JobModifyIndex": 952215, "ModifyIndex": 952217, "ModifyTime": 1644242162370615000, "Namespace": "default", "NextEval": "", "NodeID": "", "NodeModifyIndex": 0, "PreviousEval": "", "Priority": 50, "QueuedAllocations": { "autoscaler": 0 }, "QuotaLimitReached": "", "SnapshotIndex": 952215, "Status": "complete", "StatusDescription": "", "TriggeredBy": "job-register", "Type": "system", "Wait": 0, "WaitUntil": null }
An example of the alloc status of one of the alloc of job autoscaler :
nomad alloc status -verbose 8a761570
ID = 8a761570-0895-cfb5-f44c-88c35702eff5
Eval ID = 5c0c0297-2306-7642-fa5f-8378e7872cd1
Name = autoscaler.autoscaler[0]
Node ID = 58b148c7-4c48-5d62-211e-04e301a2e762
Node Name = ip
Job ID = autoscaler
Job Version = 0
Client Status = running
Client Description = Tasks are running
Desired Status = run
Desired Description = <none>
Created = 2022-02-07T15:56:02+02:00
Modified = 2022-02-07T15:56:04+02:00
Evaluated Nodes = 1
Filtered Nodes = 0
Exhausted Nodes = 0
Allocation Time = 63.591µs
Failures = 0
Allocation Addresses (mode = "bridge")
Label Dynamic Address
*http yes
*sidecar yes
*stats yes
Task "connect-proxy-autoscaler" (prestart sidecar) is "running"
Task Resources
CPU Memory Disk Addresses
2/250 MHz 15 MiB/128 MiB 300 MiB
Task Events:
Started At = 2022-02-07T13:56:03Z
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-02-07T15:56:03+02:00 Started Task started by client
2022-02-07T15:56:03+02:00 Task Setup Building Task Directory
2022-02-07T15:56:02+02:00 Received Task received by client
Task "autoscaler" is "running"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 7.6 MiB/256 MiB 300 MiB
Task Events:
Started At = 2022-02-07T13:56:04Z
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-02-07T15:56:04+02:00 Started Task started by client
2022-02-07T15:56:03+02:00 Task Setup Building Task Directory
2022-02-07T15:56:02+02:00 Received Task received by client
Placement Metrics
Node binpack final score
58b148c7-4c48-5d62-211e-04e301a2e762 0.915 0.915
2) Yes, it's a system job.
P.S. I appreciate you editing the question as it didn't work for me initially.
Hi @empikls! I was able to reproduce this and it looks like it has something to do with adding and removing constraints. Suppose I have a jobspec like the following:
When I run this it works with no problem:
$ nomad job status example
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
043f7c0a 64862576 web 0 run running 11s ago 11s ago
56fe29d8 689f56f1 web 0 run running 11s ago 11s ago
But if I add this constraint, which all nodes should accept because my whole cluster is in this node class:
constraint {
attribute = "${node.class}"
operator = "="
value = "vagrant"
}
Then I get an error very similar to what you saw:
$ nomad job run ./jobs/system-per-dc.nomad
==> 2022-02-08T16:17:44-05:00: Monitoring evaluation "7d4d950e"
2022-02-08T16:17:44-05:00: Evaluation triggered by job "example"
==> 2022-02-08T16:17:45-05:00: Monitoring evaluation "7d4d950e"
2022-02-08T16:17:45-05:00: Allocation "56fe29d8" modified: node "689f56f1", group "web"
2022-02-08T16:17:45-05:00: Allocation "043f7c0a" modified: node "64862576", group "web"
2022-02-08T16:17:45-05:00: Evaluation status changed: "pending" -> "complete"
==> 2022-02-08T16:17:45-05:00: Evaluation "7d4d950e" finished with status "complete" but failed to place all allocations:
2022-02-08T16:17:45-05:00: Task Group "web" (failed to place 1 allocation):
* Class "vagrant": 1 nodes excluded by filter
* Constraint "distinct_property: ${node.datacenter}=dc1 used by 2 allocs": 1 nodes excluded by filter
What's even stranger is that if I then remove the constraint again, I get the same error message! That persists until I change the resources or something else in the job.
This also happens in reverse. If I purge the job and start over with the constraint, it works fine. But then if I remove the constraint it fails in the same way.
Something to note for a workaround for you: if you only want 1 per DC, you could also run this as a service job. If I run the following job with where count
== the number DCs, then the problem goes away. I can add and remove the node class constraint and it seems to work just fine every time.
I've run into a similar issue on spread
recently, so it may turn out this is related. I'll mark this issue for further investigation. But in the meanwhile, would the workaround of using a service
job with count = 2
work for you?
Hi @tgross ! I did some research and found that when I changed the job type from system
to service
, there were no issues with the deployment, but when I switched again, the problem found out again. I would like to leave everything as it is, and the option with count = 2 is not suitable for me, since there are cases when the constraint is not executed , and the gitlab job must still continue to deploy the next job in the list. Example :
variable "resources" {
type = map(map(number))
default = {
kafka = {
memory = 2048
cpu = 2000
}
}
}
job "kafka" {
datacenters = ["euc1", "fsn1"]
type = "system"
constraint {
attribute = "${node.class}"
operator = "regexp"
value = "(cloud-)?kafka"
}
group "kafka" {
network {
mode = "host"
port "public" {
to = 9093
static = 9093
host_network = "public"
}
port "private" {
to = 9094
static = 9094
host_network = "private"
}
port "exporter" {
to = 8080
static = 8080
host_network = "private"
}
}
constraint {
operator = "distinct_hosts"
value = "true"
}
...........
A node class like stage-fsn1-kafka-2
can only be in the prod and stage environment.
ID = 8523c7ce-86d8-6756-714e-482593b29ef6
Name = stage-fsn1-kafka-0
Class = kafka
DC = fsn1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 2139h14m55s
Host Volumes = consul_socket,kafka_data,zookeeper_data,zookeeper_datalog
Host Networks = <none>
CSI Volumes = <none>
Driver Status = docker,exec
Node Events
Time Subsystem Message
2022-02-04T13:05:47Z Cluster Node reregistered by heartbeat
2022-02-04T13:05:20Z Cluster Node heartbeat missed
2021-11-22T15:01:28Z Cluster Node reregistered by heartbeat
2021-11-22T15:00:44Z Cluster Node heartbeat missed
2021-11-17T18:34:11Z Cluster Node reregistered by heartbeat
2021-11-17T18:32:15Z Cluster Node heartbeat missed
2021-11-17T18:31:27Z Cluster Node reregistered by heartbeat
2021-11-17T18:31:27Z Cluster Node heartbeat missed
2021-11-12T10:16:23Z Cluster Node re-registered
2021-11-12T10:16:17Z Cluster Node heartbeat missed
Allocated Resources
CPU Memory Disk
3250/4588 MHz 3.6 GiB/3.6 GiB 900 MiB/32 GiB
Allocation Resource Utilization
CPU Memory
44/4588 MHz 1.0 GiB/3.6 GiB
Host Resource Utilization
CPU Memory Disk
187/4588 MHz 1.4 GiB/3.6 GiB 3.7 GiB/38 GiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
3998f938 8523c7ce zookeeper 3 run running 5d25m ago 5d25m ago
1de0689f 8523c7ce kafka 1 run running 5d25m ago 1d6h ago
1baf6218 8523c7ce consul_exporter 0 run running 5d25m ago 5d25m ago
And when I try to deploy the jobs through gitlab, I get this error like that ( doesn't mattersystem
or service
job type; error might be a bit another ) :
==> Evaluation "c3b5c9b7" finished with status "complete" but failed to place all allocations:
Task Group "kafka" (failed to place 1 allocation):
* Class "cache": 1 nodes excluded by filter
* Class "stream-processing": 1 nodes excluded by filter
* Class "object-detection": 1 nodes excluded by filter
* Class "cloud-cache": 1 nodes excluded by filter
* Class "cloud-cpu-worker": 2 nodes excluded by filter
* Constraint "${node.class} regexp (cloud-)?kafka": 6 nodes excluded by filter
So i can't complete the gitlab job of deploying all nomad jobs ( autoscaler, kafka, redis, etc) that follows the failed one. Aside: when I was using Nomad version 1.0.1, I saw something like this in the qa-1 env:
ID = kafka
Name = kafka
Submit Date = 2021-11-12T13:15:40+02:00
Type = system
Priority = 50
Datacenters = euc1,fsn1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
kafka 0 0 0 0 0 0
Allocations
No allocations placed
and something like this in stage env :
ID = kafka
Name = kafka
Submit Date = 2021-11-18T13:36:18+02:00
Type = system
Priority = 50
Datacenters = euc1,fsn1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
kafka 0 0 3 1 5 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
1de0689f 8523c7ce kafka 1 run running 5d16m ago 1d6h ago
2d5f1fc1 4a044431 kafka 1 run running 28d3h ago 1d6h ago
cb308c2b f91f9b69 kafka 1 run running 2mo8d ago 1d6h ago
I believe we are running into the same problem. We have system jobs that have extra constraints and when we run the job from our pipeline we get an error if there are no changes in the job file. Of course, we should not need to submit the exact same job twice, but I also think that Nomad should not give errors if we do. I have copied the jobspec that @tgross created on feb. 8 (here https://github.com/hashicorp/nomad/issues/12016#issuecomment-1033082994) and saved it to a file called busybox.job. I needed to edit the datacenters, as we don't have a dc1 or dc2 and then sent this twice to our Nomad cluster. You can see the output below. First time everything is ok, second time we get an error about constraints and the nomad run command exits with error code 2:
nomad run busybox.job
==> 2022-02-24T16:31:56+01:00: Monitoring evaluation "4e579128"
2022-02-24T16:31:56+01:00: Evaluation triggered by job "example"
==> 2022-02-24T16:31:57+01:00: Monitoring evaluation "4e579128"
2022-02-24T16:31:57+01:00: Allocation "4487a627" created: node "44ef7387", group "web"
2022-02-24T16:31:57+01:00: Evaluation status changed: "pending" -> "complete"
==> 2022-02-24T16:31:57+01:00: Evaluation "4e579128" finished with status "complete"
nomad run busybox.job
==> 2022-02-24T16:32:18+01:00: Monitoring evaluation "6c224097"
2022-02-24T16:32:18+01:00: Evaluation triggered by job "example"
2022-02-24T16:32:18+01:00: Evaluation status changed: "pending" -> "complete"
==> 2022-02-24T16:32:18+01:00: Evaluation "6c224097" finished with status "complete" but failed to place all allocations:
2022-02-24T16:32:18+01:00: Task Group "web" (failed to place 1 allocation):
* Constraint "distinct_property: ${node.datacenter}=dcf used by 1 allocs": 24 nodes excluded by filter
* Constraint "missing drivers": 10 nodes excluded by filter
echo $?
2
This is the output of nomad eval status of the second run:
nomad eval status 6c224097
ID = 6c224097
Create Time = 22s ago
Modify Time = 22s ago
Status = complete
Status Description = complete
Type = system
TriggeredBy = job-register
Job ID = example
Priority = 50
Placement Failures = true
Failed Placements
Task Group "web" (failed to place 1 allocation):
* Constraint "distinct_property: ${node.datacenter}=dcf used by 1 allocs": 24 nodes excluded by filter
* Constraint "missing drivers": 10 nodes excluded by filter
And this is part of output from nomad eval list -json:
nomad eval list -json
{
"AnnotatePlan": false,
"BlockedEval": "",
"ClassEligibility": null,
"CreateIndex": 3247051,
"CreateTime": 1645716738898975777,
"DeploymentID": "",
"EscapedComputedClass": false,
"FailedTGAllocs": {
"web": {
"AllocationTime": 240496,
"ClassExhausted": null,
"ClassFiltered": {},
"CoalescedFailures": 0,
"ConstraintFiltered": {
"distinct_property: ${node.datacenter}=dcf used by 1 allocs": 24,
"missing drivers": 10
},
"DimensionExhausted": null,
"NodesAvailable": null,
"NodesEvaluated": 34,
"NodesExhausted": 0,
"NodesFiltered": 34,
"QuotaExhausted": null,
"ResourcesExhausted": null,
"ScoreMetaData": null,
"Scores": null
}
},
"ID": "6c224097-6ddb-1055-6767-ea0432934c8b",
"JobID": "example",
"JobModifyIndex": 3247045,
"ModifyIndex": 3247052,
"ModifyTime": 1645716738903368401,
"Namespace": "default",
"NextEval": "",
"NodeID": "",
"NodeModifyIndex": 0,
"PreviousEval": "",
"Priority": 50,
"QueuedAllocations": {
"web": 0
},
"QuotaLimitReached": "",
"SnapshotIndex": 3247051,
"Status": "complete",
"StatusDescription": "",
"TriggeredBy": "job-register",
"Type": "system",
"Wait": 0,
"WaitUntil": null
},
{
"AnnotatePlan": false,
"BlockedEval": "",
"ClassEligibility": null,
"CreateIndex": 3247045,
"CreateTime": 1645716716359085223,
"DeploymentID": "",
"EscapedComputedClass": false,
"FailedTGAllocs": null,
"ID": "4e579128-c83b-773e-6888-d45ffd7b4e09",
"JobID": "example",
"JobModifyIndex": 3247045,
"ModifyIndex": 3247047,
"ModifyTime": 1645716716712492756,
"Namespace": "default",
"NextEval": "",
"NodeID": "",
"NodeModifyIndex": 0,
"PreviousEval": "",
"Priority": 50,
"QueuedAllocations": {
"web": 0
},
"QuotaLimitReached": "",
"SnapshotIndex": 3247045,
"Status": "complete",
"StatusDescription": "",
"TriggeredBy": "job-register",
"Type": "system",
"Wait": 0,
"WaitUntil": null
}
I have tried to setup a small testing environment to reproduce this error, but then I'm not yet able to get this message. If I send the same job twice to this Nomad testing cluster, Nomad just accepts the job and does nothing which is to be expected since it's the same job. I'm not sure what the difference could be between this testing cluster and our other cluster that gives an error. We are running Nomad 1.2.4 on our servers and clients.
I don't know what information to add more. But the problem is still there. If there are no problems in a small test environment, this doesn't mean that there are no problems. I believe the changes that happened in version 1.2.4 (or earlier versions) are the reason for this, because I have this pipeline running stable in version 1.1.0.
I've run into a similar issue on
spread
recently, so it may turn out this is related. I'll mark this issue for further investigation. But in the meanwhile, would the workaround of using aservice
job withcount = 2
work for you?
Consider the recommended use case for ceph csi nodes as a system job.
I appreciate the strictness of the system job enforcing the constraint
for system jobs, but is it intentional that it's so heavy it ignores our explicit constraint
that is allowed on system job specs? If so then is there a use case at all for allowing the constraint
stanza on system jobs because anything other than 1==1
becomes useless as the system job is so strict anything that makes allocation count < node count
throw the same confusing message?
2022-03-18T08:36:19-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-03-18T08:36:19-04:00: Evaluation "b9f5b83a" finished with status "complete" but failed to place all allocations:
2022-03-18T08:36:19-04:00: Task Group "cephrbd" (failed to place 1 allocation):
* Constraint "${attr.unique.hostname} != hc4-A": 1 nodes excluded by filter
When looking within the ui it becomes more clear
however the CLI calls it lost:
Summary
Task Group Queued Starting Running Failed Complete Lost
cephrbd 0 0 6 18 23 1
While I prefer the CLI over the UI the former makes me confused, we all seem just as Lost with just "running something everywhere except here" type jobs. One thought I had is I don't really care if it's service or system but if I could do service and count = -1
where -1 means all nodes
then I can have my constraints but have to add a distinct hosts to achieve the "spread" effect to everyone and my desired "except this host" use case.
Noting that https://github.com/hashicorp/nomad/issues/12366 seems to be related, or at least may provide a clue that we're mutating the job definition unexpectedly.
Also noting that https://github.com/hashicorp/nomad/issues/12748 seems like it could be related.
After revisiting this issue, we were able to pint point it to:
Description: When a system job is given a “distinct_property” constraint that excludes any of the nodes, every time the job is updated, the update will be correctly run but it will return:
2023-04-28T12:12:17+02:00: Evaluation status changed: "pending" -> "complete"
==> 2023-04-28T12:12:17+02:00: Evaluation "ee32c81f" finished with status "complete" but failed to place all allocations:
2023-04-28T12:12:17+02:00: Task Group "cache" (failed to place 1 allocation):
* Constraint "distinct_property: ${node.datacenter}=dc1 used by 2 allocs": 1 nodes excluded by filter
Giving the false impression the update didn’t happen.
Root Cause: Because this is a system job, it will define that every task group is required to be running on each node and when checking for the differences between what is running and what should be running, it will iterate over all ready nodes, both with and without running allocations, and for each it will:
Update any running allocation
Verify there is at least one running allocation for each task group defined in the job
And since some of the nodes are filtered and dont have running allocations for that job in particular, they will all be marked as missing allocations. Then when computing the placement of the missing allocations, the iterator will apply the filters, including the “distinct_property” one and wont find any good node to place the “missing” allocation, throwing the error present on the output.
Possible solution:
After the diff is done and iterated over all nodes, a new verification can be done to exclude the “missing” allocations that will break the constraints, here.
Filter out the excluded nodes before iterating and finding the differences, here.
Check for the constraint that cause the filtered out nodes is “distinct_property” and ignore the error, but not sure what possible side effects can happen.
Nomad version 1.5.1
Nomad job example
job "example" {
type = "system"
constraint {
operator = "distinct_property"
attribute = "${node.datacenter}"
value = "1"
}
group "cache" {
count = 1
network {
port "db" {
to = 6379
}
}
service {
name = "cache"
port = "db"
provider = "nomad"
}
restart {
attempts = 2
interval = "30m"
delay = "15s"
mode = "fail"
}
ephemeral_disk {
size = 300
}
task "redis" {
driver = "docker"
config {
image = "redis:7"
ports = ["db"]
auth_soft_fail = true
}
identity {
env = true
file = true
}
}
}
}
Running on a cluster with 3 servers and 2 clients.
Nomad version
1.2.5
Nomad job example
.gitlab-ci.yml
Issue
When trying to deploy job via gitlab job failed when allocation is already placed or cannot find any constraint that can be missed nomad-qa-2 nomad job status autoscaler
Output from gitlab: