Closed c4milo closed 3 years ago
It is definitely something we are aware of and will be doing with Nomad. However there are many more pressing improvements that it is not something we will be focusing on in the near term.
Relevant: http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf
About memory bandwidth isolation:
CPU cache allocation and memory bandwidth monitoring:
Thanks for the link. There are some tricky aspects with implementation, not least of which is applications running on the platform need to be aware of resource reclamation (for example, by observing memory pressure). In practice this is a complicated thing to implement.
In light of that there are probably a few different approaches to this, in no particular order (and I'm not sure how these work at all on non-linux kernels):
Memory is probably the most complicated because it is a hard resource limit. For soft limits like CPU and Network it's fairly easy to over-provision without crashing processes, but it's more difficult to provide QOS or consistent / knowable behavior across the cluster.
In general this problem is not easy to implement or reason about without a lot of extra data that is out currently of Nomad's scope. For example we would need to look at historical resource usage for a particular application in order to resize it appropriately and differentiate between real usage vs. memory leaks and such, monitor the impact of resizing on application health (e.g. does resizing it cause it to crash more frequently), etc.
So while this is something we'd like to do, and we're aware of the value it represents, this is likely not something we will be able to get to in the short / medium term.
@cbednarski, indeed, it is a complex feature. I believe it can be implemented to some extent, though. This is the list of rough tasks I made while going through the paper, it is not by any means complete:
Finally, some personal remarks:
My thoughts around oversubscription are -
a. We need guaranteed QoS guarantees for a Task. Tasks which are guaranteed certain resources should get them whenever they need it. Trying to add and take away resources for the same task doesn't work well in all cases especially if it's memory resources. CPU and I/O probably can be tuned up and down unless the task has very burty resource usages.
b. What works well however is the concept that certain jobs which can be revocable are oversubscribed along with jobs which are are not revocalble. And we estimate how much we can oversubscribe and run some revocable jobs when capacity is available and revoke them when the jobs which have been guaranteed the capacity needs them.
You may be interested in these videos about Google's Borg, which include a discussion of how oversubscription is handled with SLOs:
Any ETA or implementation details on this one? :)
No, this feature is very far down the line
I don't understand: if I just use Docker to run containers it doesn't impose any restrictions or reservations on CPU or memory. This software, in contrast, requires the user to do so. If that's the case then I guess I'll just use Docker. And I don't need the clumsy "driver" concept just to support all those "vagrant" things afloat no one really needs in modern microservice architectures. Procrustean bed this is called.
@halt-hammerzeit There are very different requirements for a cluster scheduler and a local container runtime. Nomad imposes resource constraints so it can bin-pack nodes and provide a certain quality of service for tasks it places.
When you don't place any resource constraints using Docker locally, you get no guarantee that container won't utilize all the CPU/Memory on your system and that is not suitable for a cluster scheduler! Hope that helps!
@dadgar Certainty isn't required in some cases. For example we are currently running around 30 QA environments and for that we need a lot of servers (each environment (job) needs around 2GB of memory in order to cover memory spikes). Utilization of those servers is very low and we can't work around those memory spikes (e.g. PHP Symfony2 apps require cache warm-up at startup which consumes 3 times the memory that is actually needed for runtime). I should be able to schedule those QA environments on a single server and I don't care if QoS is degraded since it's for testing purposes only. Scheduler should still make decisions based on task memory limit provided but we should be able to define soft limit and disable hard memory limit on docker containers. Something like #2771 would be great. Other container platforms such as ECS and Kubernetes handle this just fine.
I think it is difficult to ask developers to define resource allocations for services especially for new services or when a service is running in a wide range of environments. I understand that Nomad's approach greatly simplifies bin packing, but this does us little good if we aren't good at predicting the required resources. One reason this is particularly challenging for us is because we have 100's of different production environments (multiple tiers of multi-tenancy + lots of single tenants with a wide range of hardware and usage requirements). Even if we can generalize some of these configurations, I believe that explicit resource allocation could be an undesirable challenge for us to take on for each service.
Clearly there was a lot of thought put behind the current design. The relatively low priority on this issue also indicates to me a strong opinion that the current offering is at least acceptable if not desirable for a wide range of use cases. Maybe some guidance on configuring resource allocation, especially in cases where we lack a-priori knowledge of the load requirements of a service would be helpful.
Ultimately my goal is to provide developers a nearly "Cloud Foundry" like experience. "Here's my service, make it run, I don't care how". I really like Nomad's simplicity compared to other solutions like Kubernetes, but this particular issue could be an adoption blocker. I'm happy to discuss further or provide more detail about my particular scenario here or on Gitter.
@tshak I would recommend you over-allocate resources for new jobs and then monitor the actual resource usage and adjust the resource ask to be inline with your actual requirement.
@tshak Or look into Kubernetes which seems to claim to support such a feature
Thanks @dadgar. Unfortunately this is a high burden since we have many environments of varying sizes and workloads that these services run on. We've got a good discussion going on Gitter if you're interested in more. As @catamphetamine said, we may have to use Kubernetes. The concept of "Guarunteed" vs. "Burstable" vs. "BestEffort" seems to better fit our use case. I was hoping for Nomad to have (or plan for the near future) something similar since Nomad is otherwise preferable to me!
I was too giving it a second thought yesterday, after abandoning Nomad in summer due to it lacking this feature. Containers are meant to be "stateless" and "ephemeral" so if a container crashes due to an Out Of Memory error then ideally it would have no difference as the code should automatically retry the API query. In the real world though there's no "auto retry" feature in any code so if an API request fails the whole transaction may be left in an inconsistent state possibly corrupting application's data.
I think what kubernetes calls "burstable" is the most intractable use case here, many services that we run use a large heap during start up and then have relatively low memory usage after startup.
one of the services I've been monitoring requires as much as 4 gigs during startup to warm up its cache and then it typically runs around 750mb of ram during normal operation. With nomad I must allocate 4 gigs of ram for each of these microservices which is really expensive.
Is it possible with Nomad at this time to support this burstable feature?
I have a job which consumes a lot of CPU when launching and then falls back to almost no CPU.
However, Nomad (using Docker driver) kills off this job before it can get over its initial peak.
I cannot allocate enough CPU to get this started or it doesnt find an allocation target
However, Nomad (using Docker driver) kills off this job before it can get over its initial peak. -- @CumpsD
Nomad should not be killing a job due to its CPU usage. Just so you know: by default Docker/exec/java/rkt CPU limits are soft limits meaning they allow bursting CPU usage today. If the count is >1 you may want to use the distinct_hosts
constraint on the initial run to make sure multiple instances aren't contending for resources on the same host, but beyond the initial run the deployments
feature can prevent instances from starting at the same time during their warmup period.
While we still plan on first class oversubscription someday, CPU bursting should be beneficial for your use case today. Please open a new issue if you think there's a bug.
@schmichael it was a wrong assumption, the cpu spike at start and the stopping were unrelated. After debugging for a day it turned out the inner process stopped, causing the job to stop and restart
Sorry for the confusion. After fixing it, it is obvious the limit is indeed a soft one :)
Can this feature be prioritized? I really (really really) need the 'memory over-subscription' for dockers. I could do with a "-1" flag which basically mean "don't care" and launch the dockers with no memory constraints.
This could be an agent config setting, and thus not get enabled for all agents.
I have a few dockers (tasks) which get allocated to a single node (based on the node name
constraint at job level) and I cannot assign small memory values to them as then they don't start at all.
Ref: https://www.nomadproject.io/docs/drivers/docker.html#memory
I thus end up launching a larger machine than really needed to satisfy the resource requirements.
I get the SLA bit and the need for specifying the resource requirements upfront, but the "-1" flag along with allowing swap is what would really solve this in an elegant way.
_I have been contemplating launching the tasks on that single node using a docker-compose
(yech!) as a raw_exec
task._
At this point, I am even okay with flags like:
i_really_need_swap_enabled
and ignore_memory_constraints_during_launch
.
`
With Nomad 0.9 we can do plugins - I certainly plan to add a docker driver with exactly those features :)
Can you tell status of this ticket?
@shantanugadgil @academiqnsu
I think we have a workaround for this. You can tell Nomad how much memory to expect for each node in the agent config: https://www.nomadproject.io/docs/configuration/client.html . If you set this to some factor higher (e.g. 3x higher), you can then set all the memory limits for your jobs 3x higher as well. The placements will go through, because it will check against the resources described in the config...But the tasks won't be killed, because the physical memory use won't run into the 3x higher limit you've set.
I have to say, I think Nomad should probably rethink the policies around over-subscription, and maybe build a more easy affordance for this.
Most programmers work with memory managed languages these days, which makes it almost impossible to stick to a memory budget. There are tonnes of situations where you'll have a short period where you're even 5x over the normal memory use. Predicting these short memory bursts is very difficult just from reading the code, so you'll get a lot of eventual job failures if there's no over-subscription.
Serialization is a particularly common cause of memory spikes. If you've got most of your memory in some data structure and then you write that data structure to disk, you might create a byte string for the whole object. That's at least 2x usage right there. While constructing the string, maybe the serialization library creates some intermediate representation --- now you're at 3x. Etc. Dying at serialization like this is particularly nasty, because it means you'll very often hit a pattern where you run for a while, do a bunch of work, and die just as you're trying to save it.
Those serialization issues could happen in any language, but in languages like Python the total size that the process is using is particularly opaque. Python doesn't really give you guarantees about when objects will be freed, and it's really normal not to worry much about the fact that two copies might be temporarily around if you write something like thing = func(thing)
.
@honnibal I had also reached a similar conclusion to fix the first part of the problem, i.e. to allow multiple tasks to actually start.
This works great for raw_exec
, as memory constraints are considered only during allocation and not the actual run.
But, for Docker, using swap has been disabled in some ancient version (0.4, I think)
My ask of having swap, ensures that the docker tasks won't die if all of them really start allocating large amounts of memory.
Just saw this issue: https://github.com/hashicorp/nomad/issues/6085
It sounds like docker does swap, currently, but unintentionally? This might complicate the analysis...But in the meantime if you install 0.9.4 it might work?
Awesome find @honnibal !!!
How have other people kept an eye on jobs running overallocated? Do you use the HTTP API? I can't quite get the information I want from the telemetry data.
A little up for this need : i have many jobs who need cpu at start... but almost nothing after. Having possibility to overallocating cpu/mem can be really usefull. I will try the solution with the agent config, but it's not really a solution.
How would oversubscription help in this scenario?
(sorry for late answer) In my case, i have this node for example :
Allocated Resources
CPU Memory Disk
26980/27600 MHz 67 GiB/157 GiB 6.4 GiB/1.6 TiB
Allocation Resource Utilization
CPU Memory
674/27600 MHz 19 GiB/157 GiB
Host Resource Utilization
CPU Memory Disk
3752/27600 MHz 44 GiB/157 GiB 81 GiB/1.6 TiB
My host is full... but unused... due to several "too high" allocation :
for i in $(nomad node status -self -short | grep running | awk '{print $1}'); do nomad alloc status $i | grep MHz; done
17/4000 MHz 8.8 GiB/14 GiB 300 MiB http: 10.2.200.138:23314
24/4000 MHz 1.3 GiB/18 GiB 300 MiB http: 10.2.200.138:22872
27/4000 MHz 772 MiB/1.2 GiB 300 MiB http: 10.2.200.138:22120
19/1000 MHz 819 MiB/1000 MiB 300 MiB http: 10.2.200.138:23039
0/1000 MHz 1.1 MiB/600 MiB 300 MiB ssh: 10.2.200.138:24484
5/2000 MHz 1.5 GiB/2.4 GiB 300 MiB http: 10.2.200.138:30431
11/1000 MHz 791 MiB/1000 MiB 300 MiB http: 10.2.200.138:27430
30/1000 MHz 632 MiB/4.0 GiB 300 MiB http: 10.2.200.138:26262
10/1000 MHz 661 MiB/4.0 GiB 300 MiB http: 10.2.200.138:22422
99/1000 MHz 903 MiB/1.0 GiB 300 MiB http: 10.2.200.138:28412
0/200 MHz 10 MiB/1.5 GiB 300 MiB http: 10.2.200.138:29148
0/1000 MHz 6.8 MiB/4.0 GiB 300 MiB http: 10.2.200.138:25068
0/200 MHz 24 MiB/1.5 GiB 300 MiB http: 10.2.200.138:27136
125/80 MHz 2.4 MiB/500 MiB 300 MiB flexlm: 10.2.200.138:6200
26/1000 MHz 918 MiB/4.0 GiB 300 MiB http: 10.2.200.138:27596
4/1000 MHz 85 MiB/4.0 GiB 300 MiB http: 10.2.200.138:24475
11/1000 MHz 1.6 GiB/2.0 GiB 300 MiB http: 10.2.200.138:30787
0/1000 MHz 1.0 MiB/600 MiB 300 MiB ssh: 10.2.200.138:29025
0/200 MHz 1.4 MiB/1.5 GiB 300 MiB http: 10.2.200.138:21228
0/1000 MHz 1.0 MiB/600 MiB 300 MiB ssh: 10.2.200.138:37625
1/100 MHz 1.1 MiB/200 MiB 300 MiB
68/200 MHz 39 MiB/500 MiB 300 MiB
By default, i affect 1000 MHz... but with many jobs per node... i reach the limit. With oversubscription, i may always run new jobs... and keep an eye on real usage (all 1000 MHz above will lowered in the future, but we always need time to watch real usage of each job).
You can always lowball the CPU resource requirement on your tasks to achieve jamming more tasks onto a node. That's essentially, oversubscription for your use case. AFAIK, the resource
> CPU
does not explicitly place cgroup limitations on the tasks. You'd just have more processes fighting for CPU slices.
Reducing cpu for docker doesn't work well in the common use case that the docker container startup is using that value for further internal assignments amd allocations. Example: java cmdline inside docker task to set the -X mem parameters
Sorry, @shantanugadgil, there lies my beef with Java, or anything JVM, such a pain in resource management. That aside, I agree, it may not work for all scenarios, but it's also important in this discussion to point out these holes so if anyone do take up the cause, will take into account of these cases. It's probably just me, but I mistrust oversubscription, and I don't need another suspicion to lurk around when I need to troubleshoot things.
In my opinion, the entire premise of 'no swap' and 'up front reservation' of CPU/memory is more of a typical PROD requirement, whereas dev/validation testing (non stress) can get away with a slower speeds. In the spirit of keeping the job definition same for dev/prod, tweaking the agent values to allow more tasks to be scheduled on a node make sense to me. If it were up to me, I would even have a "factor" of how much to tweak it per environment; x1 for PROD (i.e. no tweaking), x2 for QA and x3 for DEV, given that there would be some cfg mgmt system to create the agent config files. (detect memory and multiple by 3 😀 )
What we're working towards as workaround is to use the client -> memory_total_mb stanza to just artificially increase the memory on the instances to fake "over-subscription" :)
What we're working towards as workaround is to use the client -> memory_total_mb stanza to just artificially increase the memory on the instances to fake "over-subscription" :)
yes, that is the "factor" I am talking about. Better to lie in the agent config than to starve the task of resources 😆
On the roadmap - coming soon.
Glad to hear this is on the roadmap. Any ETA available?
Memory oversubscription has shipped in Nomad 1.1.0-beta. See https://github.com/hashicorp/nomad/pull/10247
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
I noticed scheduling decisions are being made based on what jobs are reporting as desired capacity. However, some tasks involved might not really use what they originally intended to or will become idle thus holding back resources that could be used by other tasks. Are there plans to improve this in the short to mid term?