Open tgross opened 3 years ago
@tgross - to add to this. I ran some tests on Nomad 1.0.1 and see the following
We have a Zookeeper Deployment that consists of the following (in the same group)
1x Prestart Task NO sidecar (64 CPU) 1x Task (64 CPU) 1x Task (100 CPU) 1x Task (512 CPU)
Nomad UI reports reserved CPU: 740
So both line 2 and line 4 both seem to be incorrect in the "Allocated" behavior as I would expect to see 676. Here's a screenshot of the allocation, and also the topology visualization
Allocation Summary
Topology Visualization
Group Summary
Thanks for those details @idrennanvmware. I did some further digging and although it looks like the scheduler is doing what I'd expect it to according to the design, but the reported resource allocation is incorrect.
I ran the current HEAD on a machine that Nomad fingerprints with 1.9GiB of memory:
$ nomad node status -self
...
Allocated Resources
CPU Memory Disk
0/18424 MHz 0 B/1.9 GiB 0 B/51 GiB
Allocation Resource Utilization
CPU Memory
0/18424 MHz 0 B/1.9 GiB
Host Resource Utilization
CPU Memory Disk
4606/18424 MHz 231 MiB/1.9 GiB (/dev/mapper/vagrant--vg-root)
Allocations
No allocations placed
The web UI agrees with what we see here, as we'd expect:
I have a job with a prestart task and a redis, each of which take 1000MB of memory. If I run the prestart task with sidecar = true
, the scheduler correctly tells me I've exceeded the available resources:
==> Evaluation "ca5b1ddb" finished with status "complete" but failed to place all allocations:
Task Group "group" (failed to place 1 allocation):
* Resources exhausted on 1 nodes
* Dimension "memory" exhausted on 1 nodes
Evaluation "5a589e1f" waiting for additional capacity to place remainder
So instead I run the prestart task with sidecar = false
. Here's that jobspec:
I run it:
$ nomad job run ./example.nomad
==> Monitoring evaluation "7790e985"
Evaluation triggered by job "example"
==> Monitoring evaluation "7790e985"
Evaluation within deployment: "82f7c4cd"
Allocation "8c495e9f" created: node "691ad76c", group "group"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7790e985" finished with status "complete"
Wait a few seconds for the prestart task to finish and the main task to start, then check the allocation status:
$ nomad alloc status 8c4
ID = 8c495e9f-94aa-c964-b574-374addfe6d49
...
Task "init" (prestart) is "dead"
Task Resources
CPU Memory Disk Addresses
0/500 MHz 32 KiB/1000 MiB 300 MiB
...
Task "redis" is "running"
Task Resources
CPU Memory Disk Addresses
0/500 MHz 988 KiB/1000 MiB 300 MiB
Now if we look at the node status, we see we're reporting more memory allocated than is available on the host!
$ nomad node status -self
...
Allocated Resources
CPU Memory Disk
1000/18424 MHz 2.0 GiB/1.9 GiB 300 MiB/51 GiB
Allocation Resource Utilization
CPU Memory
2/18424 MHz 1016 KiB/1.9 GiB
Host Resource Utilization
CPU Memory Disk
49/18424 MHz 268 MiB/1.9 GiB (/dev/mapper/vagrant--vg-root)
Allocations
ID Node ID Task Group Version Desired Status Created Modified
8c495e9f 691ad76c group 0 run running 1m9s ago 52s ago
If we look at the allocation in the web UI we see a similar (incorrect) value:
But it turns out the scheduler still seems to have the same view of the world. If we run a job that takes 500MB of memory, I'd expect it to work.
Run that job and the scheduler accepts it:
$ nomad job run ./example2.nomad
==> Monitoring evaluation "e71d30af"
Evaluation triggered by job "example2"
==> Monitoring evaluation "e71d30af"
Evaluation within deployment: "fbad2107"
Allocation "deb460ad" created: node "691ad76c", group "group"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "e71d30af" finished with status "complete"
And now the node reports even more memory usage which is greater than that available on the host:
$ nomad node status -self
...
Allocated Resources
CPU Memory Disk
1500/18424 MHz 2.4 GiB/1.9 GiB 600 MiB/51 GiB
Allocation Resource Utilization
CPU Memory
2/18424 MHz 2.0 MiB/1.9 GiB
Host Resource Utilization
CPU Memory Disk
75/18424 MHz 290 MiB/1.9 GiB (/dev/mapper/vagrant--vg-root)
Allocations
ID Node ID Task Group Version Desired Status Created Modified
deb460ad 691ad76c group 0 run running 45s ago 34s ago
8c495e9f 691ad76c group 0 run running 3m42s ago 3m25s ago
So this is not just a documentation issue but a bug in the way that the resources are being reported for tasks with a non-sidecar prestart. At this point I'm going to tag my colleagues @jazzyfresh and @notnoop who worked on this feature and see if they have any insights.
"Now if we look at the node status, we see we're reporting more memory allocated than is available on the host!"
I think this explains a strange topology total count (resources) we were seeing but hadn't tracked down!
Thanks for the really detailed breakdown.
Following up on this, are there any new developments here?
I'll echo the thanks for the breakdown above, this has been really instructive in understanding how Nomad is mis-reporting what the actual usage is. I was expecting us to have to rework all our prestart tasks to be run within the container before the actual task runs in the same container to save on the ram. (In our case it's doubling the ram allocation I expect it to use in the UI!)
We're seeing the same over provision showing in the UI for memory, nodes reporting 8.7GB memory used out of 7.7GB available. We don't have memory overprovisoning enabled (yet), which was confusing me as to how we could get into that state. free -m
on the node is happily reporting we're using ~4.2GB of memory in reality, which lines up with prestart tasks showing as allocating an extra 1GB apiece in the UI.
As @tgross mentioned, I tried to reproduce the same thing.
Here is the node status before I ran any jobs. It has 25 GB of free space.
nomad node status -self
ID = cddb7076-acd3-4c5c-7d0f-8faa72dc427b
Name = NB305032
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 1h34m37s
Host Volumes = <none>
Host Networks = <none>
CSI Volumes = <none>
Driver Status = docker,exec,raw_exec
Node Events
Time Subsystem Message
2023-09-27T11:36:21+05:30 Cluster Node registered
Allocated Resources
CPU Memory Disk
0/39920 MHz 0 B/25 GiB 0 B/72 GiB
Allocation Resource Utilization
CPU Memory
0/39920 MHz 0 B/25 GiB
Host Resource Utilization
CPU Memory Disk
48/39920 MHz 357 MiB/25 GiB 166 GiB/251 GiB
Allocations
No allocations placed
Now, I ran the example job mentioned above consisting pre-start task and sidecar=false
with 21 GB of allocated memory and a main task with 1 GB of memory.
nomad job run example.nomad.hcl
==> 2023-09-27T11:47:12+05:30: Monitoring evaluation "6e216ee6"
2023-09-27T11:47:12+05:30: Evaluation triggered by job "example"
2023-09-27T11:47:12+05:30: Allocation "0c2ac837" created: node "cddb7076", group "group"
2023-09-27T11:47:13+05:30: Evaluation within deployment: "989599ff"
2023-09-27T11:47:13+05:30: Evaluation status changed: "pending" -> "complete"
==> 2023-09-27T11:47:13+05:30: Evaluation "6e216ee6" finished with status "complete"
==> 2023-09-27T11:47:13+05:30: Monitoring deployment "989599ff"
✓ Deployment "989599ff" successful
2023-09-27T11:47:35+05:30
ID = 989599ff
Job ID = example
Job Version = 0
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
group 1 1 1 0 2023-09-27T11:57:34+05:30
It deployed successfully and here is the allocation status
nomad alloc status 0c2ac
ID = 0c2ac837-c833-f63e-ca6b-e4d24c33d765
Eval ID = 6e216ee6
Name = example.group[0]
Node ID = cddb7076
Node Name = NB305032
Job ID = example
Job Version = 0
Client Status = running
Client Description = Tasks are running
Desired Status = run
Desired Description = <none>
Created = 5m38s ago
Modified = 5m16s ago
Deployment ID = 989599ff
Deployment Health = healthy
Task "init" (prestart) is "dead"
Task Resources:
CPU Memory Disk Addresses
0/500 MHz 32 KiB/20 GiB 300 MiB
Task Events:
Started At = 2023-09-27T06:17:18Z
Finished At = 2023-09-27T06:17:24Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2023-09-27T11:47:24+05:30 Terminated Exit Code: 0
2023-09-27T11:47:18+05:30 Started Task started by client
2023-09-27T11:47:12+05:30 Driver Downloading image
2023-09-27T11:47:12+05:30 Task Setup Building Task Directory
2023-09-27T11:47:12+05:30 Received Task received by client
Task "redis" is "running"
Task Resources:
CPU Memory Disk Addresses
1/500 MHz 6.2 MiB/1000 MiB 300 MiB
Task Events:
Started At = 2023-09-27T06:17:24Z
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2023-09-27T11:47:24+05:30 Started Task started by client
2023-09-27T11:47:24+05:30 Task Setup Building Task Directory
2023-09-27T11:47:12+05:30 Received Task received by client
I ran the node status after pre-start task dead.
nomad node status -self
ID = cddb7076-acd3-4c5c-7d0f-8faa72dc427b
Name = NB305032
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 2h1m54s
Host Volumes = <none>
Host Networks = <none>
CSI Volumes = <none>
Driver Status = docker,exec,raw_exec
Node Events
Time Subsystem Message
2023-09-27T11:36:21+05:30 Cluster Node registered
Allocated Resources
CPU Memory Disk
1000/39920 MHz 22 GiB/25 GiB 300 MiB/72 GiB
Allocation Resource Utilization
CPU Memory
3/39920 MHz 8.4 MiB/25 GiB
Host Resource Utilization
CPU Memory Disk
124/39920 MHz 462 MiB/25 GiB 166 GiB/251 GiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
0c2ac837 cddb7076 group 0 run running 18m28s ago 18m6s ago
It's showing 22 GB of allocated memory out of 25 GB. As the pre-start task was allocated 21 GB of memory, let's presume that 21GB of memory is released. So 24 GB of memory would be free according to the scheduler.
So, I tried to run the example2 job mentioned above with 5 GB of memory. But it's failed to allocate due to memory exhaust.
nomad job run example2.nomad.hcl
==> 2023-09-27T12:13:07+05:30: Monitoring evaluation "c4c39628"
2023-09-27T12:13:07+05:30: Evaluation triggered by job "example2"
2023-09-27T12:13:08+05:30: Evaluation within deployment: "4060348e"
2023-09-27T12:13:08+05:30: Evaluation status changed: "pending" -> "complete"
==> 2023-09-27T12:13:08+05:30: Evaluation "c4c39628" finished with status "complete" but failed to place all allocations:
2023-09-27T12:13:08+05:30: Task Group "group" (failed to place 1 allocation):
* Resources exhausted on 1 nodes
* Dimension "memory" exhausted on 1 nodes
2023-09-27T12:13:08+05:30: Evaluation "6fe7e6c4" waiting for additional capacity to place remainder
==> 2023-09-27T12:13:08+05:30: Monitoring deployment "4060348e"
⠋ Deployment "4060348e" in progress...
2023-09-27T12:13:08+05:30
ID = 4060348e
Job ID = example2
Job Version = 0
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
group 1 0 0 0 N/A
I am not able to understand, why the scheduler couldn't allocate the 5 GB of memory.
Over on Discuss we noted that we're missing documentation around how the scheduler allocates resources when there are
prestart
tasks in play:We should have documentation in the
lifecycle
docs about how resources are scheduled, probably cross-linked with theresources
docs.