kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
114.6k stars 40.41k forks source link

Resource Advice in Pod Templates #2768

Closed erictune closed 8 years ago

erictune commented 10 years ago

Starting assumption: Kubernetes should start defaulting to having hard enforcement of pod memory and cpu limits, and requiring pods to make resource requests. That's a premise of this issue -- if you disagree with it, let's discuss it in a separate issue.

If you accept that premise, then pods will have to have resource limits set on them. But, who should set them?

I think that a lot of people will want to start with a pod spec written by someone else. I'll call that a template, but in this context I don't mean the pod template that a replication controller uses.

The person who writes the pod template is in the best position to know things like:

The person who instantiates the pod template is in the best position to know if his usage scenario is small, medium, large, etc.

How to split those responsibilities, then?

Multiple templates

One approach would be to come up with several templates for different use cases, like this:

Filename some_java_app_1G.pod.template contains:

{ "kind": "Pod", "apiVersion": "v1beta1", "id": "some-java-app",
   "desiredState": {  "manifest": { "containers": [{
        "image": "some-java-app",
        "command": ["java", "-jar", "some-app.jar","-Xmx", "1G"],
        "memory": "1G",
      }]  }}

... and so on for 512MG, 2G, and various sizes. However, this doesn't take advantage of the continuously adjustable resource limits provided by containers.

Parameters to Templates

A template file could declare and document its parameters, perhaps inside comments. Something like this:

File some_java_app.pod.template contains:

# Params:
#    MAX_MEM:
#      Type: Bytes
#      Default: 1G
#      MinRecommended: 128MB
#      Description: Container memory limit.
{ "kind": "Pod", "apiVersion": "v1beta1", "id": "some-java-app",
   "desiredState": {  "manifest": { "containers": [{
        "image": "some-java-app",
        "command": ["java", "-jar", "some-app.jar","-Xmx", "$MAX_MEM"],
        "memory": "1G",
      }]  }}

then you could use a tool to expand the template like this:

ktemplate expand some_java_app.pod.template | kubectl createall

which would use the default value or else set your own value:

ktemplate expand --set MAX_MEM=2G some_java_app.pod.template | kubectl createall

and it could have have warnings:

$ ktemplate expand --set MAX_MEM=1M some_java_app.pod.template | kubectl createall
ktemplate: warning: MAX_MEM less than MinRecommended (1G).

However, it seems like it is a short step from just substituting variables, to wanting to do computations (e.g. set -Xmx to 95% of the container limit).

Generator script

To allow for computations in templates, you could make up a DSL, or you could just let people use whatever language they want, like:

File some_java_app_podmaker.go contains:

import (
  "flag"
  "fmt"
)
var mem = flag.Int("memory", 1024*1024*1024, "memory limit")
main() {
  fmt.Printf(
`
{ "kind": "Pod", "apiVersion": "v1beta1", "id": "some-java-app",
   "desiredState": {  "manifest": { "containers": [{
        "image": "some-java-app",
        "command": ["java", "-jar", "some-app.jar","-Xmx", "%d"],
        "memory": %d,
      }]  }}
`, 0.9 * mem, mem)

and run like this: go run some_java_app.podmaker.go -- --mem 2000000000 | kubectl createall

Complex systems

Real examples would have multiple pods and replication controllers, and services and such. How will people share knowledge about how to write more complex config? How would that integrate with horizontal scaling of pods? Automatic vertical scaling?

bgrant0607 commented 10 years ago

I disagree with your premise. I think vertical auto-sizing is the way to go for many users. But this doesn't invalidate your overall premise, which is that a non-negligible number of users will want to provide resource specifications (let's not call them limits, since that's only part of the specification) for pre-existing configurations.

I think you missed one reasonable option: The configuration creator documents in comments the likely resources required, and the configuration user simply copies and edits the configuration for their scenario. Simple, predictable, reproducible, version-able, diff-able when the user wants to "rebase" to an updated configuration.

Multiple copies isn't a bad idea, either. It has most of the same properties as copy-and-modify, potentially saving a step but degrading to the same result.

Issues I see with ktempate expand --set:

The generator example is one way to generate application-specific configs in general. However, again, a domain-specific pass for this would be reasonable, IMO.

We could define a transformation pass that injected resource specifications. I'd even be ok with handling common language runtimes like Java in a first-class way, but I also think there are 2 better answers for setting Java heap size:

  1. Convey resources into the container via the downward API #386, and create an off-the-shelf wrapper for launching Java and other common languages/runtimes that used the downward API to get the resources then set runtime-specific resource options. This would be pretty easy if we had container volumes #831 (similar to our internal bind-mountable packages). We could also just create an off-the-shelf Java image for Kubernetes that contained the wrapper.
  2. Create a domain-specific transformation pass to inject environment variables, which could be picked up by a wrapper again, or just by wrapping the command line with a shell. Injecting command-line arguments would be much trickier and, besides, I hate command-line flags and don't want to encourage them as a configuration mechanism. The user could also create their own wrapper which read the arguments from an arbitrary configuration source a la #1553.
davidopp commented 10 years ago

I think @dchen1107 and @rjnagal discussed limits with @brendandburns and had some ideas that Dawn was going to write up. Maybe this issue would be a good place to put them. Brendan described it to me but I wasn't at the original discussion so it's better if one of them writes up the summary.

With respect to your proposal, one somewhat futuristic thing that occurred to me is that it would be nice if the Dockerfile format could be paramaterized along the lines of your template generator, so a Docker image could be shipped with information about a set of <command line, resource requirements> tuples that you could choose from when you deploy the container. I'm basing this suggestion on the assumption that in many cases the person who creates the Docker container has the best understanding of the resource requirements, and that may be someone who isn't even at the same organization as the person who is deploying it in Kubernetes.

vmarmol commented 10 years ago

@dchen1107 @rjnagal @brendandburns and I spoke a bit about limits on Thursday. Our POV was similar to @bgrant0607's in that we believe in the long term most users will use some auto-scaler to set limits (such that from the node's perspective there are limits) while a small subset will want to set their own limits. To get there it was thought that enforcing limits on all containers was too big of a hammer for current users so we thought of a way to try to bridge that gap.

The idea is that users that set their own limits today know what their containers require and want those resources to be guaranteed. Users that don't set their limit don't know or don't care what their containers need. In the node, we will artificially create two classes of containers: those with limits and those without. The containers with limits will be guaranteed their resources, while those without will receive them on a best-effort basis. On our-of-resources scenarios we will throttle or kill the containers without limits in favor of those with limits. This encourages users to set limits on containers, but allows blank limits for the time being. The reasoning for doing this at the node level rather than at a higher level is to allow the future inclusion of things like the auto-scaler. Once that component exists, the system continues to work without any changes on the node.

Our focus will initially be with CPU and memory. I think we can have most of this complete in the coming week since the changes are not extensive.

I think that what @erictune brings up here is separate from the above though as I see these templates still being useful for any jobs setting limits.

davidopp commented 10 years ago

With respect to your proposal, one somewhat futuristic thing that occurred to me is that it would be nice if the Dockerfile format could be paramaterized along the lines of your template generator, so a Docker image could be shipped with information about a set of tuples that you could choose from when you deploy the container.

I guess markdown doesn't like angle brackets, as it ate part of my sentence. What I wrote was "a set of (command line flag, resource requirement) tuples"

smarterclayton commented 10 years ago

In terms of software consumers, there is a set of useful information that isn't captured in #168 - the ability of a pod template author to convey minimum requirements. Most application authors or image creators are likely to be able to (although they may not start from the perspective of doing it initially) of defining a minimum memory requirement for their app, or minimum disk space, or minimum network IO. The value is that it guards against guaranteed failure of pods packed into nodes below that limit. In a world of people generating and reusing images and pod templates, giving authors the tools to define minimums also seems valuable.

Eric and talked through this briefly, which is what triggered the parameterization discussion.

bgrant0607 commented 10 years ago

I agree with @rjnagal's proposal to use the specification of limits to set the effective QoS level (#147). I was thinking of putting all limitless containers into a single set of cgroups, which would be dynamically resized to reserve capacity for the containers that set limits. That's not possible to do through Docker at the moment, sadly. (Note that if we could do that, I'd like to do something similar with individual pods.) We're also discussing what we can do with oom adjust and other mechanisms.

As for minimum requirements, I agree we should have it; that's called request in resources.md. I could imagine auto-tuning request values, also, in order to influence at least placement of pods by the scheduler.

bgrant0607 commented 10 years ago

/cc @vishh

smarterclayton commented 8 years ago

Do we need this open still? I think we've defined most of the pieces of this although we haven't captured the actual philosophy - use requests when writing your software, allow admins to enforce limits (hard or soft) via out of band processes, use auto-sizing to estimate in the absence of info, use rescheduling and cluster info to revise initial estimations, try to avoid over specifying as end user (unless you know for sure).

erictune commented 8 years ago

Fine with closing this.

There will be multiple templating systems:

Different ones may have different takes on how to default limits.

Admins may enforce upper limits on limits. Users should set request=limit if they need best QoS. Not sure if Guaranteed or Burstable is the best default. Right now there is not enough pressure to pick one or the other.