incorrect calculation of free storage after client restart

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.94k stars 1.96k forks source link

incorrect calculation of free storage after client restart #6172

Open ygersie opened 5 years ago

ygersie commented 5 years ago

Version

I've tested and reproduced on 2 older versions of Nomad (0.5.6 and 0.7.1) and this seems to be still an issue in 0.9.4 as well.

Issue

Nomad can not allocate resources under the false presumption that we ran out of disk resources.

Reproduction steps

Start a Nomad agent. Compare the Allocated Resources section of a nomad node-status -self with df -h reported available space. Now try to schedule a job with an ephemeral disk size similar to what is available to Nomad. This should succeed. Now fill up the disk using some random data: dd if=/dev/zero of=/root/test.img bs=1M count=4096. Try a nomad plan again, this should still succeed. Now restart the Nomad agent and try a nomad plan again, it should now fail stating it ran out of disk resources.

The behaviour w.r.t. disk resources is different than for example CPU and Memory:

Nomad allocated resources are reporting disk "free" fingerprinted during agent startup, instead of "total" disk space available.
Nomad seems to deduct "allocated" GiB's from fingerprint reported "free" GiB's. This is not correct as the reported "free" GiB's changes over time when you restart the Nomad agent and have running jobs occupying disk space.

A real world example which clearly shows we should have plenty of disk resources available on this node but due the deduction: (free (566 GiB) - allocated (440 GiB)) we ran out:

ygersie@worker059:~$ nomad node-status -self
<snip>...
Allocated Resources
CPU              Memory         Disk             IOPS
12000/26388 MHz  24 GiB/62 GiB  440 GiB/566 GiB  0/0

Allocation Resource Utilization
CPU             Memory
1427/26388 MHz  20 GiB/62 GiB

Host Resource Utilization
CPU             Memory         Disk
1970/26388 MHz  28 GiB/63 GiB  354 GiB/984 GiB
<snip>....

ygersie@worker059:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       985G  355G  591G  38% /

and here an example job which asks to schedule Redis on worker059:

ygersie@nomad001:~$ cat example.nomad
job "example" {
  region = "ams1"
  datacenters = ["zone2"]
  type = "service"
  constraint {
    attribute = "${node.unique.name}"
    value     = "worker059"
  }

  group "cache" {
    count = 1
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    ephemeral_disk {
      size = 150000
    }
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          mbits = 1
          port "db" {}
        }
      }
    }
  }
}
ygersie@nomad001:~$ nomad plan example.nomad
+ Job: "example"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Constraint "${node.unique.name} = worker059" filtered 63 nodes
    * Resources exhausted on 1 nodes
    * Dimension "disk exhausted" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad run -check-index 0 example.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Impact

We can not schedule any more jobs even though we have plenty of disk resources available.. This is a pretty significant bug as I do not know a way to workaround it except lowering the resource ask dramatically. For CPU and Memory we can override what Nomad thinks it has available but not for Disk.

cgbaker commented 5 years ago

Hi @ygersie,

There are two mechanisms at play here. The first is the initial fingerprinting that a Nomad client conducts on startup. As you noted, many of these are static fingerprints; included in these are total memory, total CPU, and available disk, in order to decide the quantity of resources available for scheduling. The available resources can be adjust using the reserved stanza in the client config.

CPU and memory fingerprinting use total capacity (less any reserved amount); these numbers are constantly in flux, so the intention is to be able to use the reserved stanza to leave enough head-room for the Nomad client and other system processes. Disk, however, is not so dynamic, especially if the Nomad client is directed to use a dedicated volume. In the event that, for whatever odd reason, you free up a large amount of storage and want to make it available for scheduling by Nomad, the only available avenue for adjusting this is to restart the node to re-run the static fingerprinting.

The other piece at play here is the resource allocation that happens at scheduling. The resources listed in the resources stanza of a task indicate the amount of resources that will be allocated to the task at scheduling. These allocated resources are removed from consideration for scheduling other tasks, regardless of whether the task actually uses them or not. In some cases, there are mechanisms in place to actually prevent the tasks from exceeding their allocation (e.g., OOM killer). In other cases, they are not (ephemeral_disk usage). Once all schedulable resources are allocated to tasks, no more tasks can be placed. If tasks are allocated excessive resources, to the extent that new tasks cannot be placed even though there are plenty of unused resources, the solution is to not request excessive resources. This is the case for CPU and memory and network bandwidth, as well as disk.

ygersie commented 5 years ago

Hey @cgbaker, thanks for the quick reply here. I know how the resource allocation works however what I'm stating here is that there's only 3 * 150 GiB allocated on the worker mentioned in the example and still Nomad shows that we ran out of resources. The reason I suspect the stale fingerprint is at play here is because changing the disk utilization using a dummy image doesn't influence the scheduling of the job until I restart the Nomad agent.

ygersie commented 5 years ago

Maybe to clarify a little bit better. Why can I not schedule a job asking for 150 GiB on the worker in the example while Nomad reports only 440GiB (see the output of nomad node-status) has been allocated. The total size of the disk is 985G, what happened to the remaining disk capacity?

cgbaker commented 5 years ago

You cannot schedule a job for 150 GiB, because there is only 566 GiB total disk for scheduling (i.e., the free space detected when fingerprinted at startup), and you have already allocated 440 GiB of that amount (leaving only 126 GiB available for scheduling new allocations). It doesn't matter whether you're actually using the space or not.

Consider this: if you tried to schedule a job using 20,000 MHz of CPU, it would fail to schedule, because you have allocated 12,000/26,388 MHz, leaving only 14,388 MHz unallocated. This is in spite of the fact that the host resources show that 24,418 MHz is actually available at the moment that you ran node status.

Nomad does treat disk differently from CPU/Memory; for disk, the amount available for scheduling is the free space during fingerprinting at startup. For CPU/Mem, the amount available for scheduling is the system total, less anything in the reserved block. As noted above, this is because we assume that there is a certain amount of disk utilization already, which will not be freed.

There is an argument to be made that disk should be treated the same as CPU/Mem: the full disk is available for scheduling, and that users are responsible for using the reserved stanza to account for disk space that is already used.

There is also an argument to be made that disk fingerprinting should not be static, but dynamic. This is a reasonable proposal; however, it is a significant change with many consequences. This is because the fingerprinting would need to account for the fact that allocated tasks may be using some-but-not-all of the storage that was allocated to them. Fingerprinting would therefore need to be allocation-aware. Also, this type of fingerprinting is very disruptive, because it continuously updates the node information (which invokes the scheduler for all allocations on the nodes, potentially causing existing allocations to be moved to other nodes).

I've brought this up with the team and we're discussing whether this should be changed.

ygersie commented 5 years ago

I'm still not entirely following. I get everything you're saying w.r.t. how disk is different to determine in terms of scheduling but if you take into account the size of the storage occupied by Nomad allocations you have massive resource loss. The reason Nomad fingerprints the storage available and not the storage total (I assume) is because you can not from a Nomad perspective determine what other applications on the system are utilizing disk space. If Nomad were the only thing responsible for resource management you wouldn't have this issue in the first place, you could just use the total capacity. Just like with CPU and Memory it can't be Nomad's task to see if external sources are causing over-utilization. So why not subtract allocated from total and a default sane "reserved" percentage for any non-Nomad managed resources.

This fingerprinting now leads to the situation where I can only allocate (allocate not even actually use) 440GiB out of 985GB. Surely this is not something we want and should be considered a bug. The goal of a scheduler is to efficiently schedule resources which is definitely not the case now..

Anyway, surely appreciated that you're following up on this issue!

cgbaker commented 5 years ago

So why not subtract allocated from total and a default sane "reserved" percentage for any non-Nomad managed resources.

That is one approach, and it's probably necessary if we use the disk total size as the schedulable amount.

This fingerprinting now leads to the situation where I can only allocate (allocate not even actually use) 440GiB out of 985GB.

I want to make sure we're on the same page... the output posted above indicates that your Nomad cluster has already allocated 440 GiB. That storage has been earmarked in the scheduler for the tasks in those allocations on this node. And in the case of this particular node, Nomad will schedule allocations with storage requirements up to a total of 566 GiB; the tasks running in those allocations should feel free to use the storage allocated to them. But we would not want Nomad to schedule workload for the whole 985G, because Nomad detected at startup that 355G of it was already used by external components.

Can you please explain where you think the "massive resource loss" is occurring?

ygersie commented 5 years ago

Ok, I know this is kind of confusing, I'm trying to explain the issue the best I can 😄 So in above situation like shown in my shell output I can not schedule a job with a resource ask of only 150GiB. What I think is happening: Nomad agent during startup uses the "available" column from command:

ygersie@worker059:~$ df -kP /var/lib/nomad
Filesystem     1024-blocks      Used Available Capacity Mounted on
/dev/vda1       1032089344 381532420 608584208      39% /

This is perfectly fine the moment you start Nomad for the first time and nothing is occupying disk space yet. Now everything is working like it should and we schedule resources on this worker (in above example 3 allocations each asking for 150GiB for a total of 450GiB). Time passes and these allocations start utilizing the disk, in my example a total of around 365GB is actually used.

At this point I restart the Nomad agent and the fingerprint using df -kP /var/lib/nomad runs again but now of course it reports that it only has a total available of 580GB because it also includes the data of the running allocations..

As I've already allocated a total of 440GiB in running allocations Nomad states: oh shit, I only have 580GB total available now, and I've already allocated 440GiB so sorry, your resource ask for an additional 150GiB is denied due to running out of resources, which is ridiculous of course as there should at least be another 460GiB available. That's why I'm stating, because you're calculating the resource utilization of running allocations in the grand total of "this is what is available on this node" you basically have "massive resource loss" if at any point you restart the Agent while allocations have been filling up the disk.

I hope this explains the issue better. If not, it may be easier to setup a call?

ygersie commented 5 years ago

fyi: for now I'm shipping a patched version which fingerprints the total available disk space and I'll use the reserved client option to reserve space, I don't have any other workaround. AFAICT scanning through the code there's no risk in doing so, if there is please let me know 😃

cgbaker commented 5 years ago

Yes, that's right. I noticed the bug where the storage that is actually used by allocations is not considered by the storage fingerprinter. I wasn't sure whether that was the problem you were reporting or whether it was something else.

Thanks for bringing this up and for being patient. 😄

As to your patch, it's one approach. It has the added benefit of treating storage the same as CPU and memory. Under that approach, because the amount fingerprinted is not dependent on the used storage, then so as long as the reserved amount is set appropriately and no tasks exceed their allocated storage, everything should be fine.

It's possible that a tighter bound (which doesn't require manually setting the reserved stanza) could be generated by instead having the storage fingerprinter consider the storage used in the allocation directories; that would be a bigger change to the code.

ygersie commented 5 years ago

You are most welcome and thanks for the follow up. As I see it there’s a couple of options:

Calculate allocation disk usage during fingerprinting and add that back to the “available/free” total. This can be a costly operation depending on the amount of files located in the alloc storage dir of Nomad.
For Unix fingerprinting use the (total)-(filesystem configured reserved root)-(reserved for external sources).
Use filesystem total with a default sane reserved percentage and leave it to the users to adjust if necessary.

I would opt for the simple solution (option #3) as it would also be similar to memory and cpu, but I do not have the context you guys have.

faerstice commented 4 years ago

We're running Nomad 0.10.5 and bumped against this issue recently after we moved to standardized workload "sizing" on our Nomad clusters to make it easier to do capacity planning. After we restarted Nomad clients to pick up a meta data change, users started to see puzzling behavior where they could no longer re-deploy successfully running allocations: even though they had made no changes to the number of resources requested, they were getting errors about resource constraints. As operators, we were confused as to why a meta data change started tripping our monitoring around Nomad disk availability.

Reading through this issue was enlightening. We routinely restart the Nomad client in order to pick up meta data changes, new client SSL certs, etc. Thus, we'd find situations where Nomad's calculation of the amount of disk available would, in essence, cause it to be counted twice (once for the currently running allocation and once for the desired allocation).

Fingerprinting using the filesystem total and reserving a percentage of that total to come up with a "disk available" number would fit well with our Nomad setup. As it is, we've had to remove the ephemeral disk attributes from task groups and have been relying on other, less immediate, means to control user job disk utilization.

NodeGuy commented 3 years ago

Oh man this is confusing! I just bought a bunch of drives unnecessarily because I mistakenly believed I didn't have enough ephemeral disk capacity as a result of this bug.

This is my new ritual for working around the bug when adding a meta key to a client:

#!/bin/sh
nomad node drain -enable -force -self
service nomad stop
umount -R /mnt/nomad_alloc
mount /mnt/nomad_alloc
rm -rf /mnt/nomad_alloc/*
service nomad start
sleep 10
nomad node eligibility -self -enable

douglasawh commented 2 years ago

Is this still a bug in the latest or has the underlying code changed so much that this is no longer an issue?

GloomyDay commented 2 years ago

same on 1.2.6 , tested today : before update and restart

Allocated Resources CPU Memory Disk 120198/248000 MHz 28 GiB/29 GiB 63 GiB/138 GiB

Allocation Resource Utilization CPU Memory 3470/248000 MHz 17 GiB/29 GiB

Host Resource Utilization CPU Memory Disk 9340/250000 MHz 26 GiB/31 GiB 90 GiB/150 GiB

after

Allocated Resources CPU Memory Disk 1664/248000 MHz 399 MiB/29 GiB 1.2 GiB/59 GiB

Allocation Resource Utilization CPU Memory 1199/248000 MHz 171 MiB/29 GiB

Host Resource Utilization CPU Memory Disk 7913/250000 MHz 7.2 GiB/31 GiB 90 GiB/150 GiB

tgross commented 1 year ago

See also https://github.com/hashicorp/nomad/issues/14871, which we closed as a duplicate of this one.

Note we don't have a fix in-progress though. One idea we've had some discussions around is to have the client mount a loopback filesystem that we could discard with the allocation. That would let us make firm determinations on disk quotas in the process.

kjschubert commented 1 year ago

@tgross Would it be possible as a short-term workaround to implement a field disk_total_mb in the client-stanza similar to cpu_total_compute and memory_total_mb? Or should I open another issue for this?

I stumbled across this issue as I have a cluster with three client agents where I mix workloads with both high and low disk space demands. Those clients are dedicated nomad workers, whereby I know exactly how much disk space should be allocatable and how much should be reserved for system tasks. At the moment I start to decrease the size in the ephermal_disk stanza to keep nomad scheduling allocations. I have plenty of disk space left, but nomad does not want to allocate it as soon as some of those bigger allocations aren't garbage collected before a restart.

tgross commented 1 year ago

That seems like a reasonable approach.