hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.6k stars 1.93k forks source link

resource allocation for non-sidecar prestart tasks #9725

Open tgross opened 3 years ago

tgross commented 3 years ago

Over on Discuss we noted that we're missing documentation around how the scheduler allocates resources when there are prestart tasks in play:

The scheduler should be “doing the right thing” inasmuch as it should be allocating the minimum amount of resources required for the entire allocation, taking into account what tasks are running concurrently due to lifecycle. So for an example using RAM resources:

Prestart Task Main Task Allocated
100MB (sidecar) 200MB 300MB
100MB (no sidecar) 200MB 200MB
200MB (sidecar) 100MB 300MB
200MB (no sidecar) 100MB 200MB

It looks like the last line in that table is the unfortunate case you’re running into?

I suspect that when we were designing that there was an assumption that in the common case the main task would require more resources. And it seems that we’re accounting for the entire allocation restarting, even though the only way that typically happens is if a user does a nomad alloc restart – the restart block of the jobspec controls the restart of tasks, not the whole alloc.

We should have documentation in the lifecycle docs about how resources are scheduled, probably cross-linked with the resources docs.

idrennanvmware commented 3 years ago

@tgross - to add to this. I ran some tests on Nomad 1.0.1 and see the following

We have a Zookeeper Deployment that consists of the following (in the same group)

1x Prestart Task NO sidecar (64 CPU) 1x Task (64 CPU) 1x Task (100 CPU) 1x Task (512 CPU)

Nomad UI reports reserved CPU: 740

So both line 2 and line 4 both seem to be incorrect in the "Allocated" behavior as I would expect to see 676. Here's a screenshot of the allocation, and also the topology visualization

Allocation Summary

Screen Shot 2021-01-05 at 10 58 22 AM

Topology Visualization

Screen Shot 2021-01-05 at 12 11 02 PM

Group Summary

Screen Shot 2021-01-05 at 12 12 06 PM
tgross commented 3 years ago

Thanks for those details @idrennanvmware. I did some further digging and although it looks like the scheduler is doing what I'd expect it to according to the design, but the reported resource allocation is incorrect.

I ran the current HEAD on a machine that Nomad fingerprints with 1.9GiB of memory:

$ nomad node status -self
...
Allocated Resources
CPU          Memory       Disk
0/18424 MHz  0 B/1.9 GiB  0 B/51 GiB

Allocation Resource Utilization
CPU          Memory
0/18424 MHz  0 B/1.9 GiB

Host Resource Utilization
CPU             Memory           Disk
4606/18424 MHz  231 MiB/1.9 GiB  (/dev/mapper/vagrant--vg-root)

Allocations
No allocations placed

The web UI agrees with what we see here, as we'd expect:

Screen Shot 2021-01-06 at 9 50 05 AM

I have a job with a prestart task and a redis, each of which take 1000MB of memory. If I run the prestart task with sidecar = true, the scheduler correctly tells me I've exceeded the available resources:

==> Evaluation "ca5b1ddb" finished with status "complete" but failed to place all allocations:
    Task Group "group" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "memory" exhausted on 1 nodes
    Evaluation "5a589e1f" waiting for additional capacity to place remainder

So instead I run the prestart task with sidecar = false. Here's that jobspec:

jobspec no sidecar ```hcl job "example" { datacenters = ["dc1"] group "group" { task "init" { lifecycle { hook = "prestart" sidecar = false } driver = "docker" config { image = "busybox:1" command = "/bin/sh" args = ["-c", "echo ok; sleep 5"] } resources { cpu = 500 memory = 1000 } } task "redis" { driver = "docker" config { image = "redis:3.2" } resources { cpu = 500 memory = 1000 } } } } ```

I run it:

$ nomad job run ./example.nomad
==> Monitoring evaluation "7790e985"
    Evaluation triggered by job "example"
==> Monitoring evaluation "7790e985"
    Evaluation within deployment: "82f7c4cd"
    Allocation "8c495e9f" created: node "691ad76c", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7790e985" finished with status "complete"

Wait a few seconds for the prestart task to finish and the main task to start, then check the allocation status:

$ nomad alloc status 8c4
ID                  = 8c495e9f-94aa-c964-b574-374addfe6d49
...

Task "init" (prestart) is "dead"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  32 KiB/1000 MiB  300 MiB

...

Task "redis" is "running"
Task Resources
CPU        Memory            Disk     Addresses
0/500 MHz  988 KiB/1000 MiB  300 MiB

Now if we look at the node status, we see we're reporting more memory allocated than is available on the host!

$ nomad node status -self
...

Allocated Resources
CPU             Memory           Disk
1000/18424 MHz  2.0 GiB/1.9 GiB  300 MiB/51 GiB

Allocation Resource Utilization
CPU          Memory
2/18424 MHz  1016 KiB/1.9 GiB

Host Resource Utilization
CPU           Memory           Disk
49/18424 MHz  268 MiB/1.9 GiB  (/dev/mapper/vagrant--vg-root)

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
8c495e9f  691ad76c  group       0        run      running  1m9s ago  52s ago

If we look at the allocation in the web UI we see a similar (incorrect) value:

Screen Shot 2021-01-06 at 9 52 08 AM

But it turns out the scheduler still seems to have the same view of the world. If we run a job that takes 500MB of memory, I'd expect it to work.

small jobspec ```hcl job "example2" { datacenters = ["dc1"] group "group" { task "redis" { driver = "docker" config { image = "redis:3.2" } resources { cpu = 500 memory = 500 } } } } ```

Run that job and the scheduler accepts it:

$ nomad job run ./example2.nomad
==> Monitoring evaluation "e71d30af"
    Evaluation triggered by job "example2"
==> Monitoring evaluation "e71d30af"
    Evaluation within deployment: "fbad2107"
    Allocation "deb460ad" created: node "691ad76c", group "group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "e71d30af" finished with status "complete"

And now the node reports even more memory usage which is greater than that available on the host:

$ nomad node status -self
...

Allocated Resources
CPU             Memory           Disk
1500/18424 MHz  2.4 GiB/1.9 GiB  600 MiB/51 GiB

Allocation Resource Utilization
CPU          Memory
2/18424 MHz  2.0 MiB/1.9 GiB

Host Resource Utilization
CPU           Memory           Disk
75/18424 MHz  290 MiB/1.9 GiB  (/dev/mapper/vagrant--vg-root)

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
deb460ad  691ad76c  group       0        run      running  45s ago    34s ago
8c495e9f  691ad76c  group       0        run      running  3m42s ago  3m25s ago

So this is not just a documentation issue but a bug in the way that the resources are being reported for tasks with a non-sidecar prestart. At this point I'm going to tag my colleagues @jazzyfresh and @notnoop who worked on this feature and see if they have any insights.

idrennanvmware commented 3 years ago

"Now if we look at the node status, we see we're reporting more memory allocated than is available on the host!"

I think this explains a strange topology total count (resources) we were seeing but hadn't tracked down!

Thanks for the really detailed breakdown.

idrennanvmware commented 3 years ago

Following up on this, are there any new developments here?

caius commented 2 years ago

I'll echo the thanks for the breakdown above, this has been really instructive in understanding how Nomad is mis-reporting what the actual usage is. I was expecting us to have to rework all our prestart tasks to be run within the container before the actual task runs in the same container to save on the ram. (In our case it's doubling the ram allocation I expect it to use in the UI!)

We're seeing the same over provision showing in the UI for memory, nodes reporting 8.7GB memory used out of 7.7GB available. We don't have memory overprovisoning enabled (yet), which was confusing me as to how we could get into that state. free -m on the node is happily reporting we're using ~4.2GB of memory in reality, which lines up with prestart tasks showing as allocating an extra 1GB apiece in the UI.

rangamani54 commented 9 months ago

As @tgross mentioned, I tried to reproduce the same thing.

Here is the node status before I ran any jobs. It has 25 GB of free space.

nomad node status -self
ID              = cddb7076-acd3-4c5c-7d0f-8faa72dc427b
Name            = NB305032
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 1h34m37s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,raw_exec

Node Events
Time                       Subsystem  Message
2023-09-27T11:36:21+05:30  Cluster    Node registered

Allocated Resources
CPU          Memory      Disk
0/39920 MHz  0 B/25 GiB  0 B/72 GiB

Allocation Resource Utilization
CPU          Memory
0/39920 MHz  0 B/25 GiB

Host Resource Utilization
CPU           Memory          Disk
48/39920 MHz  357 MiB/25 GiB  166 GiB/251 GiB

Allocations
No allocations placed

Now, I ran the example job mentioned above consisting pre-start task and sidecar=false with 21 GB of allocated memory and a main task with 1 GB of memory.

nomad job run example.nomad.hcl
==> 2023-09-27T11:47:12+05:30: Monitoring evaluation "6e216ee6"
    2023-09-27T11:47:12+05:30: Evaluation triggered by job "example"
    2023-09-27T11:47:12+05:30: Allocation "0c2ac837" created: node "cddb7076", group "group"
    2023-09-27T11:47:13+05:30: Evaluation within deployment: "989599ff"
    2023-09-27T11:47:13+05:30: Evaluation status changed: "pending" -> "complete"
==> 2023-09-27T11:47:13+05:30: Evaluation "6e216ee6" finished with status "complete"
==> 2023-09-27T11:47:13+05:30: Monitoring deployment "989599ff"
  ✓ Deployment "989599ff" successful

    2023-09-27T11:47:35+05:30
    ID          = 989599ff
    Job ID      = example
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    group       1        1       1        0          2023-09-27T11:57:34+05:30

It deployed successfully and here is the allocation status

nomad alloc status 0c2ac
ID                  = 0c2ac837-c833-f63e-ca6b-e4d24c33d765
Eval ID             = 6e216ee6
Name                = example.group[0]
Node ID             = cddb7076
Node Name           = NB305032
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 5m38s ago
Modified            = 5m16s ago
Deployment ID       = 989599ff
Deployment Health   = healthy

Task "init" (prestart) is "dead"
Task Resources:
CPU        Memory         Disk     Addresses
0/500 MHz  32 KiB/20 GiB  300 MiB

Task Events:
Started At     = 2023-09-27T06:17:18Z
Finished At    = 2023-09-27T06:17:24Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2023-09-27T11:47:24+05:30  Terminated  Exit Code: 0
2023-09-27T11:47:18+05:30  Started     Task started by client
2023-09-27T11:47:12+05:30  Driver      Downloading image
2023-09-27T11:47:12+05:30  Task Setup  Building Task Directory
2023-09-27T11:47:12+05:30  Received    Task received by client

Task "redis" is "running"
Task Resources:
CPU        Memory            Disk     Addresses
1/500 MHz  6.2 MiB/1000 MiB  300 MiB

Task Events:
Started At     = 2023-09-27T06:17:24Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2023-09-27T11:47:24+05:30  Started     Task started by client
2023-09-27T11:47:24+05:30  Task Setup  Building Task Directory
2023-09-27T11:47:12+05:30  Received    Task received by client

I ran the node status after pre-start task dead.

nomad node status -self
ID              = cddb7076-acd3-4c5c-7d0f-8faa72dc427b
Name            = NB305032
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h1m54s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,raw_exec

Node Events
Time                       Subsystem  Message
2023-09-27T11:36:21+05:30  Cluster    Node registered

Allocated Resources
CPU             Memory         Disk
1000/39920 MHz  22 GiB/25 GiB  300 MiB/72 GiB

Allocation Resource Utilization
CPU          Memory
3/39920 MHz  8.4 MiB/25 GiB

Host Resource Utilization
CPU            Memory          Disk
124/39920 MHz  462 MiB/25 GiB  166 GiB/251 GiB

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
0c2ac837  cddb7076  group       0        run      running  18m28s ago  18m6s ago

It's showing 22 GB of allocated memory out of 25 GB. As the pre-start task was allocated 21 GB of memory, let's presume that 21GB of memory is released. So 24 GB of memory would be free according to the scheduler.

So, I tried to run the example2 job mentioned above with 5 GB of memory. But it's failed to allocate due to memory exhaust.

nomad job run example2.nomad.hcl
==> 2023-09-27T12:13:07+05:30: Monitoring evaluation "c4c39628"
    2023-09-27T12:13:07+05:30: Evaluation triggered by job "example2"
    2023-09-27T12:13:08+05:30: Evaluation within deployment: "4060348e"
    2023-09-27T12:13:08+05:30: Evaluation status changed: "pending" -> "complete"
==> 2023-09-27T12:13:08+05:30: Evaluation "c4c39628" finished with status "complete" but failed to place all allocations:
    2023-09-27T12:13:08+05:30: Task Group "group" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "memory" exhausted on 1 nodes
    2023-09-27T12:13:08+05:30: Evaluation "6fe7e6c4" waiting for additional capacity to place remainder
==> 2023-09-27T12:13:08+05:30: Monitoring deployment "4060348e"
  ⠋ Deployment "4060348e" in progress...

    2023-09-27T12:13:08+05:30
    ID          = 4060348e
    Job ID      = example2
    Job Version = 0
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    group       1        0       0        0          N/A

I am not able to understand, why the scheduler couldn't allocate the 5 GB of memory.