hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Show max memory limit in the UI #10268

Open DingoEatingFuzz opened 3 years ago

DingoEatingFuzz commented 3 years ago

10247 introduces the ability to describe memory as both a soft and hard limit. The soft limit (memory) tells the scheduler how much memory needs to be set aside, the hard limit (memory_max) tells Nomad at what point a task should be OOMed.

This nuance also needs to be communicated in the UI. There are three pieces to this of varying scope.

  1. Show this metadata in the task group details ribbon
  2. Show both the soft and hard limit in the memory utilization graph for both allocations and tasks
  3. Show oversubscription at a client level on both the client detail page and the topology visualization

Show this metadata in the task group details ribbon

This one is straightforward. Mimic the language and data used in the CLI updates on the task group detail page. The numbers in this ribbon are already an aggregate of individual task requirements.

If a task group has no memory_max set, then this ribbon should be unchanged.

Standardoversubscription

Show both the soft and hard limit in the memory utilization graph for both allocations and tasks

First and foremost, this can be deferred. If we make no changes to this graph, it will naturally report utilization percentages above 100% and the y-axis will adjust, just like we do with CPU soft limits already. This is still pretty confusing though, since it's unclear if the percentage is based on the soft limit or the hard limit.

We can improve this by doing the following:

  1. Changing the y-axis to be based on the hard limit so utilization could never go over 100%
  2. Add the soft limit as a horizontal annotation just like the reserved capacities on clients are presented now (currently unreleased on main)
  3. Potentially segmenting the point-in-time utilization progress bar to make it immediately clear when the soft limit threshold is surpassed.

If an allocation has no memory_max set, this graph should have no annotation.

alloc-detail-oversubscription

Show oversubscription at a client level on both the client detail page and the topology visualization

There are no designs for this yet. Just wanted to mention it here to track the concept.

tgross commented 3 years ago

Closed by https://github.com/hashicorp/nomad/pull/10459

backspace commented 3 years ago

I’ve only accomplished item 1 from the first list here and am working on 2 at the moment:

image

I’ll reopen but let me know if there’s some better way to track this?

tgross commented 3 years ago

I’ve only accomplished item 1 from the first list here and am working on 2 at the moment:

Oops, sorry!

backspace commented 3 years ago

I’m leaning toward #10459 being an incorrect implementation, now that I understand this better. Or at least subpar, as I’m not sure how else to accomplish it…

When I run a Nomad dev agent without memory oversubscription enabled, I get a warning when submitting a job with a memory_max-configured task that since oversubscription isn’t enabled, that configuration will be ignored. But the API response for the job still returns the memory_max within the task’s Resources:

image

The task group details ribbon checks whether the sum of provided memory_maxes on its tasks is greater than the sum of the memorys and shows the bracketed maximum if so. This shows regardless of whether oversubscription is actually working.

I’ve subsequently understood that the allocation response is a place to determine the true situation vs the configured one. In this screenshot, I have #10508 running against two different dev agents; the left has oversubscription enabled, the right does not. You can see that AllocatedResources and Resources in the allocation response reflect the true state of things. The primary metric chart only shows the oversubscription annotation on the left, as expected.

image

So… I’m not sure what to do about the task group details ribbon, as it seems incorrect to me to present the configured memory_max even when it’s ignored, but it’s also not possible to know whether it’s been ignored from the information available to it 🤔

The allocation metric annotation is correct now, at least, but I’m struggling with accessing AllocatedResources.Tasks to properly determine the task metric annotation 😢 ETA the answer is: task states, the data is already there 😆