Closed mattmoor closed 3 years ago
In the short term, I am thinking about using warm image on side cars, builders, and base images (read: used by our configs, but our controllers wouldn't create these for users' containers).
This doesn't solve the problem, but hopefully takes a sizable bite out of it.
base images (read: used by our configs, but our controllers wouldn't create these for users' containers)
If we use warm image for base images, doesn't that make pulling customer images quicker because the shared layers already exist locally?
Agreed on the problem statement. Of the several seconds Diego spends (or used to spend, I haven't kept up with the OCI layerifying work) launching a container, the actual business of launching the container occupies less than a second. Pretty much everything else is costly I/O.
Back during our first look at this problem we figured one possibility would be to weight container placement according to the availability of layers on the Cell (Diego), or in this case Node.
The downside is that it creates a feedback loop. Node A gets a Pod, so has layers 1 and 2. Now another Pod is requested which needs layers 1 and 3. It gets preferentially scheduled to Node A. Now Node A has layers 1, 2 and 3.
This process repeats until Node A runs out of room in other dimensions, then it's Node B's turn for a thumping. In general, I don't think the scheduler should know about layer availability.
The use of "stuff" -- bits, files, whatever -- is presumably going to follow some kind of power law, so a bog-ordinary LRU cache should be fine for layers, especially if it can be prewarmed by a clusterwide supervisor which has learned which layers are in highest demand. In a Java shop, the layer with the JVM should be everywhere. In other places it will be the layer with Node and so on.
The handwaved problem here is the creation of layers that behave well for this problem. I think FTL buys a lot, as does relying on builder systems with special insight (eg, if I am always going to install node in ubuntu, I should cut my layers before and after node is installed).
@grantr that's exactly the idea. This complements FTL particularly well, since layers should be quite small.
@jchesterpivotal Yes, although there is the gotcha that we can only pull images, not layers (that is ofc unless we synthesize a new faux-manifest for single layers).
Futzing with scheduling based on image residence is an interesting idea, but as you say has trade-offs.
The handwaved problem here is the creation of layers that behave well for this problem
This! Generally I expect that whatever we do to improve build will complement this, but I think my ideal solution would enable us to make this exceptional even for BYO-containers built via methods outside of our control (e.g. Dockerfile with a mega-layer).
I've been writing up the scheme I described to @mattmoor a week or two ago, I'll link it once I'm done. I think it deals with the BYOC case.
@jonjohnsonjr pointed this out this morning, feels very relevant: https://github.com/AkihiroSuda/filegrain
In particular, this turned me onto this prior work that I wasn't aware of: https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter
My impression is that content addressing is well suited to this problem (I was going to suggest it yesterday after a discussion with @sclevine about buildpacks stuff, but I see someone has done a much better job).
Microsoft's Git VFS work might also be relevant, given the CAS-plus-lazy-loading aspect of it.
Let us steal all the good things
I want to finally start making some headway on this, but I also expect that this will be a rich area of experimentation and investment for years to come. There are a variety of solutions that have various speed/cost tradeoffs, and so it makes sense for this to start from a position of pluggability (join knative-users
to view or knative-dev
to comment).
So what I want to do in the immediate term is to spec the API. I hope this isn't particularly contentious as I've had basically the same API in mind for nearly a year, with only subtle changes as the pluggability plans have developed.
Here's what I'm thinking:
apiVersion: caching.internal.knative.dev/v1alpha1
kind: Image
metadata:
name: foo
# Namespaced because pull secrets are relative to a namespace.
namespace: default
annotations:
# Configure a specific kind of image caching.
caching.knative.dev/image.class: <kind>
# Pass configuration to a specific caching implementation.
foo.bar.dev/baz: ...
ownerReferences:
# We will delete this when the parent Revision is deleted.
spec:
# The image to attempt to cache.
image: <image path>
# optional, used to authenticate with the remote registry.
imagePullSecrets: <registry secret name>
# optional, another way of getting pull secrets (they can be attached to the K8s SA).
serviceAccountName: <robot overlord name here>
# No Status
I believe this to be the necessary and sufficient surface to configure image caching. I've tried to elaborate "Why?" for most fields inline (e.g. "Why is this namespaced?", "Why might we need image pull secrets or a service account name?")
Why no status?
A few reasons:
Here are a handful of integrations that I'd like to pursue in the near term:
knative/serving
: Add caching of sidecar images to the knative-serving
namespace.knative/serving
: Have revisions start to instantiate these as sub-resources with the user's image.knative/build
: Have BuildTemplate
create these as sub-resources for step images (cc @ImJasonH)I plan to port github.com/mattmoor/warm-image
to this model (perhaps in a branch) as a PoC that things work. As I've stated in the past, I'd like to get performance automation in place before investing substantially in performance improvements, so being able to get a side-by-side would make me really happy. I can also consider moving this under knative/foo
, if there is interest, but "no caching" would likely remain the default install option.
cc @evankanderson @vaikas-google @ImJasonH
I think this is great, I do wonder if it would be useful to have a status regardless, not necessarily to be depended on by anything, but to communicate any error conditions, or debug information, etc. I don't feel strongly about that however, just seems like a good thing to do.
Sounds good. I will add a status conforming to our conventions, but assert that this is purely diagnostic and that cache clients shouldn't expect readiness before asserting their own readiness.
Now that the caching API is in and image.caching.internal.knative.dev are being created for Knative Services, what else do we need to do for this issue for the 0.2 release? Is there a specific implementation (warm-image
or another) we'd like to get in for 0.2?
From a security standpoint, caching the image for every Knative Service on every Node does make a class of attacks easier that focus on exploiting sensitive information stored inside images. Any user would be able to spin up any other user's image just by guessing the name. Normally, in a security-conscious cluster, the solution is something akin to enabling the AlwaysPullImages
admission plugin to force ImagePullPolicy
to Always
. Does warm-image
address anything in this area? Or can we assume that users in an environment where this kind of security is a concern will disable image caching?
@bbrowning So I think what I had in mind for 0.2 was to have the extension point in place, so I'm moving this into "Done" for 0.2.
Related to this are:
WarmImage
PoC, which implements this in a fairly naive (but effective) manner for small clusters.Cachier
PoC, which leverages K8s duck typing to decorate built-in K8s types with our caching resources.Part of my reluctance to push this further in the 0.2 timeframe is that this whole area is a giant matrix of trade-offs, which will be subjective to operator cost/tenancy concerns (as you highlight above!). I'd also like to see us land performance automation, so we can quantify the impact of particular implementations under different scenarios.
I agree that just having this extension in place for 0.2 is a good start to explore further. And with Kubernetes 1.12's scheduler taking image locality into account by default (https://github.com/kubernetes/kubernetes/blob/9ba74cb5b5b9cfacbee98f61712603e0d973c8eb/CHANGELOG-1.12.md#sig-scheduling), that may influence the evolution for some of the caching implementations.
I'll be interested to watch the k8s locality-awareness evolving.
I have an instinct that precaching images falls loosely into the same problem category as autoscaling with automatic min/max selection: inventory with overage/underage costs.
Loosely, a cache hit is "in stock". A cache miss is a "stockout". The cost of a miss is "underage cost", the cost of keeping an item in the cache is "overage cost". Cache entries have an "inventory cost" representing the expense of RAM or disk space.
As with autoscaling, we can make it easier to map the technical knobs and dials back to a business question: how much are you prepared to pay for a particular probability of hit/miss?
@jchesterpivotal which is why I expect we will see a variety of implementations that cater to operators who are comfortable with different cost/performance trade-offs :)
(so I think this is a GREAT place to see what the community builds around this extension point)
Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale
.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close
.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle stale
Stale issues rot after 30 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle rotten
.
Rotten issues close after an additional 30 days of inactivity.
If this issue is safe to close now please do so by adding the comment /close
.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle rotten
/remove-lifecycle rotten
The tentative plan here is to follow onto the subsetting and node-local scheduling work with something that does a trivial prefetch to ensure the node has the image.
Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale
.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close
.
Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
Hello, 3 year old issue!
I know this is an exciting and dear-to-heart issue, but I'm wondering whether we expect keeping this issue open to actively change the status quo vs closing the issue and keeping a backlog of "things to explore to improve performance" in some other format like a markdown or Google document?
/triage needs-user-input /area API /area autoscale /kind proposal
It seems like #5913 (document integration with estargz) might be a concrete conclusion.
Closing this unless someone decides to actively resolve the interesting kubernetes-related discussions above.
/close
@evankanderson: Closing this issue.
Problem Statement
Our binary atom in Elafros (like Kubernetes) is a container. A significant (and highly variable) contributor to the latency of adding pod capacity in Kubernetes is the time it takes to pull the container image down to the node. In my experience 30s pulls aren't atypical.
The serverless space has come to expect "hyperscale", or the ability to scale from nothing to 1000s or 10000s of instances in seconds. I would posit that the single largest hurdle to implementing this on top of an adequately provisioned Kubernetes cluster is the latency to pull down a container image.
Consider this article on K8s scaling in 1.6. It talks about two things worth calling out (emphasis mine):
(Very naively) What this says to me is that if I were to ask the master for 150k pods, it could have 148500 (14x the expectation above) up within 5s if image pull latency were not a problem.
Discuss!