hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.91k stars 1.95k forks source link

Add allocation resource utilization to the /client/stats endpoint #9899

Open DingoEatingFuzz opened 3 years ago

DingoEatingFuzz commented 3 years ago

Problem

Currently the /client/stats endpoint has a variety of utilization information including CPUTicksConsumed and Memory.Used which can be used to determine percent utilization for the host. This endpoint cannot currently be used to determine the percent utilization for a host limited to allocations.

The output for nomad node status contains the following, which includes allocated resource utilization:

Allocated Resources
CPU             Memory          Disk
1000/38400 MHz  512 MiB/32 GiB  300 MiB/292 GiB

Allocation Resource Utilization
CPU           Memory
14/38400 MHz  1.9 MiB/32 GiB

Host Resource Utilization
CPU             Memory         Disk
2258/38400 MHz  22 GiB/32 GiB  156 GiB/466 GiB

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
3029bd2a  e939dc3a  cache       0        run      running  6s ago   4s ago

It achieves this by first getting the allocations on the client and then aggregating their individual resource utilization via /client/allocation/:id/stats

Proposal

Move this aggregating logic into the API layer.

This would allow the UI to also present this information without making an excessive amount of API requests (especially considering the UI polls this endpoint on a 2s interval).

Consideration: ACLs Allocation stats are dictated by the namespace:read-job permission while client stats are dictated by node:read. As part of this proposal, we're acknowledging that allocation stats in aggregate are acceptable to read with the node:read permission.

Response Shape The allocation stats response already aggregates the stats figures and returns the shape:

{
  "ResourceUsage": {
    "CpuStats": {
      "Measured": ["Throttled Periods", "Throttled Time", "Percent"],
      "Percent": 0.14159538847117795,
      "SystemMode": 0,
      "ThrottledPeriods": 0,
      "ThrottledTime": 0,
      "TotalTicks": 3.256693934837093,
      "UserMode": 0
    },
    "MemoryStats": {
      "Cache": 1744896,
      "KernelMaxUsage": 0,
      "KernelUsage": 0,
      "MaxUsage": 4710400,
      "Measured": ["RSS", "Cache", "Swap", "Max Usage"],
      "RSS": 1486848,
      "Swap": 0
    }
  }
}

The client stats response can take this same shape and further aggregate all allocations. The property name should be something like AllocatedResourceUsage or AllocationResourceUsage

Hidden benefit As @cgbaker pointed out, aggregating all alloc stats at once on the client saves us from round-tripping from the server to the client N times as is currently the case with the CLI implementation.

Related Issues

6892

8694

tgross commented 3 years ago

Also related is https://github.com/hashicorp/nomad/issues/9655