Mitigate / Eliminate image pull latency

mattmoor commented 6 years ago

Problem Statement

Our binary atom in Elafros (like Kubernetes) is a container. A significant (and highly variable) contributor to the latency of adding pod capacity in Kubernetes is the time it takes to pull the container image down to the node. In my experience 30s pulls aren't atypical.

The serverless space has come to expect "hyperscale", or the ability to scale from nothing to 1000s or 10000s of instances in seconds. I would posit that the single largest hurdle to implementing this on top of an adequately provisioned Kubernetes cluster is the latency to pull down a container image.

Consider this article on K8s scaling in 1.6. It talks about two things worth calling out (emphasis mine):

150k Pod clusters, and
99% of pods and their containers (with pre-pulled images) start within 5s.

(Very naively) What this says to me is that if I were to ask the master for 150k pods, it could have 148500 (14x the expectation above) up within 5s if image pull latency were not a problem.

Discuss!

mattmoor commented 6 years ago

In the short term, I am thinking about using warm image on side cars, builders, and base images (read: used by our configs, but our controllers wouldn't create these for users' containers).

This doesn't solve the problem, but hopefully takes a sizable bite out of it.

grantr commented 6 years ago

base images (read: used by our configs, but our controllers wouldn't create these for users' containers)

If we use warm image for base images, doesn't that make pulling customer images quicker because the shared layers already exist locally?

jchesterpivotal commented 6 years ago

Agreed on the problem statement. Of the several seconds Diego spends (or used to spend, I haven't kept up with the OCI layerifying work) launching a container, the actual business of launching the container occupies less than a second. Pretty much everything else is costly I/O.

Back during our first look at this problem we figured one possibility would be to weight container placement according to the availability of layers on the Cell (Diego), or in this case Node.

The downside is that it creates a feedback loop. Node A gets a Pod, so has layers 1 and 2. Now another Pod is requested which needs layers 1 and 3. It gets preferentially scheduled to Node A. Now Node A has layers 1, 2 and 3.

This process repeats until Node A runs out of room in other dimensions, then it's Node B's turn for a thumping. In general, I don't think the scheduler should know about layer availability.

The use of "stuff" -- bits, files, whatever -- is presumably going to follow some kind of power law, so a bog-ordinary LRU cache should be fine for layers, especially if it can be prewarmed by a clusterwide supervisor which has learned which layers are in highest demand. In a Java shop, the layer with the JVM should be everywhere. In other places it will be the layer with Node and so on.

The handwaved problem here is the creation of layers that behave well for this problem. I think FTL buys a lot, as does relying on builder systems with special insight (eg, if I am always going to install node in ubuntu, I should cut my layers before and after node is installed).

mattmoor commented 6 years ago

@grantr that's exactly the idea. This complements FTL particularly well, since layers should be quite small.

@jchesterpivotal Yes, although there is the gotcha that we can only pull images, not layers (that is ofc unless we synthesize a new faux-manifest for single layers).

Futzing with scheduling based on image residence is an interesting idea, but as you say has trade-offs.

The handwaved problem here is the creation of layers that behave well for this problem

This! Generally I expect that whatever we do to improve build will complement this, but I think my ideal solution would enable us to make this exceptional even for BYO-containers built via methods outside of our control (e.g. Dockerfile with a mega-layer).

jchesterpivotal commented 6 years ago

I've been writing up the scheme I described to @mattmoor a week or two ago, I'll link it once I'm done. I think it deals with the BYOC case.

mattmoor commented 6 years ago

@jonjohnsonjr pointed this out this morning, feels very relevant: https://github.com/AkihiroSuda/filegrain

In particular, this turned me onto this prior work that I wasn't aware of: https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter

jchesterpivotal commented 6 years ago

My impression is that content addressing is well suited to this problem (I was going to suggest it yesterday after a discussion with @sclevine about buildpacks stuff, but I see someone has done a much better job).

Microsoft's Git VFS work might also be relevant, given the CAS-plus-lazy-loading aspect of it.

Let us steal all the good things

mattmoor commented 6 years ago

I want to finally start making some headway on this, but I also expect that this will be a rich area of experimentation and investment for years to come. There are a variety of solutions that have various speed/cost tradeoffs, and so it makes sense for this to start from a position of pluggability (join knative-users to view or knative-dev to comment).

So what I want to do in the immediate term is to spec the API. I hope this isn't particularly contentious as I've had basically the same API in mind for nearly a year, with only subtle changes as the pluggability plans have developed.

Here's what I'm thinking:

apiVersion: caching.internal.knative.dev/v1alpha1
kind: Image
metadata:
  name: foo
  # Namespaced because pull secrets are relative to a namespace.
  namespace: default
  annotations:
    # Configure a specific kind of image caching.
    caching.knative.dev/image.class: <kind>
    # Pass configuration to a specific caching implementation.
    foo.bar.dev/baz: ...
  ownerReferences:
    # We will delete this when the parent Revision is deleted.

spec:
  # The image to attempt to cache.
  image: <image path>
  # optional, used to authenticate with the remote registry.
  imagePullSecrets: <registry secret name>
  # optional, another way of getting pull secrets (they can be attached to the K8s SA).
  serviceAccountName: <robot overlord name here>

# No Status

I believe this to be the necessary and sufficient surface to configure image caching. I've tried to elaborate "Why?" for most fields inline (e.g. "Why is this namespaced?", "Why might we need image pull secrets or a service account name?")

Why no status?

A few reasons:

Everything should continue to work if the caching breaks down, e.g. we should never hold a Revision for the cache being populated, or fail it because we fail to seed the cache.
A legitimate implementation of the cache should be to do NOTHING; not even writing status, but literally nothing.

Integrations

Here are a handful of integrations that I'd like to pursue in the near term:

knative/serving: Add caching of sidecar images to the knative-serving namespace.
knative/serving: Have revisions start to instantiate these as sub-resources with the user's image.
knative/build: Have BuildTemplate create these as sub-resources for step images (cc @ImJasonH)

Implementations

I plan to port github.com/mattmoor/warm-image to this model (perhaps in a branch) as a PoC that things work. As I've stated in the past, I'd like to get performance automation in place before investing substantially in performance improvements, so being able to get a side-by-side would make me really happy. I can also consider moving this under knative/foo, if there is interest, but "no caching" would likely remain the default install option.

cc @evankanderson @vaikas-google @ImJasonH

vaikas commented 6 years ago

I think this is great, I do wonder if it would be useful to have a status regardless, not necessarily to be depended on by anything, but to communicate any error conditions, or debug information, etc. I don't feel strongly about that however, just seems like a good thing to do.

mattmoor commented 6 years ago

Sounds good. I will add a status conforming to our conventions, but assert that this is purely diagnostic and that cache clients shouldn't expect readiness before asserting their own readiness.

bbrowning commented 6 years ago

Now that the caching API is in and image.caching.internal.knative.dev are being created for Knative Services, what else do we need to do for this issue for the 0.2 release? Is there a specific implementation (warm-image or another) we'd like to get in for 0.2?

From a security standpoint, caching the image for every Knative Service on every Node does make a class of attacks easier that focus on exploiting sensitive information stored inside images. Any user would be able to spin up any other user's image just by guessing the name. Normally, in a security-conscious cluster, the solution is something akin to enabling the AlwaysPullImages admission plugin to force ImagePullPolicy to Always. Does warm-image address anything in this area? Or can we assume that users in an environment where this kind of security is a concern will disable image caching?

mattmoor commented 6 years ago

@bbrowning So I think what I had in mind for 0.2 was to have the extension point in place, so I'm moving this into "Done" for 0.2.

Related to this are:

The WarmImage PoC, which implements this in a fairly naive (but effective) manner for small clusters.
The Cachier PoC, which leverages K8s duck typing to decorate built-in K8s types with our caching resources.

Part of my reluctance to push this further in the 0.2 timeframe is that this whole area is a giant matrix of trade-offs, which will be subjective to operator cost/tenancy concerns (as you highlight above!). I'd also like to see us land performance automation, so we can quantify the impact of particular implementations under different scenarios.

bbrowning commented 6 years ago

I agree that just having this extension in place for 0.2 is a good start to explore further. And with Kubernetes 1.12's scheduler taking image locality into account by default (https://github.com/kubernetes/kubernetes/blob/9ba74cb5b5b9cfacbee98f61712603e0d973c8eb/CHANGELOG-1.12.md#sig-scheduling), that may influence the evolution for some of the caching implementations.

jchesterpivotal commented 6 years ago

I'll be interested to watch the k8s locality-awareness evolving.

I have an instinct that precaching images falls loosely into the same problem category as autoscaling with automatic min/max selection: inventory with overage/underage costs.

Loosely, a cache hit is "in stock". A cache miss is a "stockout". The cost of a miss is "underage cost", the cost of keeping an item in the cache is "overage cost". Cache entries have an "inventory cost" representing the expense of RAM or disk space.

As with autoscaling, we can make it easier to map the technical knobs and dials back to a business question: how much are you prepared to pay for a particular probability of hit/miss?

mattmoor commented 6 years ago

@jchesterpivotal which is why I expect we will see a variety of implementations that cater to operators who are comfortable with different cost/performance trade-offs :)

(so I think this is a GREAT place to see what the community builds around this extension point)

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

knative-housekeeping-robot commented 4 years ago

Stale issues rot after 30 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle rotten. Rotten issues close after an additional 30 days of inactivity. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle rotten

mattmoor commented 4 years ago

/remove-lifecycle rotten

The tentative plan here is to follow onto the subsetting and node-local scheduling work with something that does a trivial prefetch to ensure the node has the image.

knative-housekeeping-robot commented 4 years ago

Issues go stale after 90 days of inactivity. Mark the issue as fresh by adding the comment /remove-lifecycle stale. Stale issues rot after an additional 30 days of inactivity and eventually close. If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

vagababov commented 4 years ago

/remove-lifecycle stale /lifecycle frozen

evankanderson commented 3 years ago

Hello, 3 year old issue!

I know this is an exciting and dear-to-heart issue, but I'm wondering whether we expect keeping this issue open to actively change the status quo vs closing the issue and keeping a backlog of "things to explore to improve performance" in some other format like a markdown or Google document?

/triage needs-user-input /area API /area autoscale /kind proposal

evankanderson commented 3 years ago

It seems like #5913 (document integration with estargz) might be a concrete conclusion.

Closing this unless someone decides to actively resolve the interesting kubernetes-related discussions above.

/close

knative-prow-robot commented 3 years ago

@evankanderson: Closing this issue.

In response to [this](https://github.com/knative/serving/issues/283#issuecomment-867189695): >It seems like #5913 (document integration with estargz) might be a concrete conclusion. > >Closing this unless someone decides to actively resolve the interesting kubernetes-related discussions above. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

knative / serving