Closed josephburnett closed 5 years ago
I'm taking a first stab at this. But instead of having Prometheus scrape the pods, I'm having the Autoscaler scrape them directly. This is because having Prometheus as a hard dependency and critical-path runtime component is a concern.
Scraping pod is hard when sidecar injection / mTLS is enabled I believe. This is a limitation of Istio, and for Prometheus we may have to use the version that comes with Istio to overcome this limitation. Not sure if we can do this for our autoscaler.
/cc @ZhiminXiang
Istio is working on a feature to support scraping pod from the Prometheus pre-installed by Istio when mTLS is enabled. Currently we use our own Prometheus, not the one installed by Istio, which could potentially block scraping pod after mTLS is enabled. /cc @mdemirhan
Scraping pod is hard when sidecar injection / mTLS is enabled
@tcnghia What makes it hard?
@josephburnett if we want to enable mTLS, then Prometheus is required to have Istio sidecar injected. After injecting sidecar into Prometheus, Prometheus is not able to scrape pod directly as Istio sidecar does not allow accessing Pod by using Pod IP.
Istio sidecar does not allow accessing Pod by using Pod IP
What technical reason prevents the Istio sidecar from allowing access by Pod IP? E.g. can we fix this?
One limitation is that all the network graph is replicated in all the Envoy sidecars. Restricting the network nodes to K8s Services, VirtualServices, and similar helps them (a) reduce the graph size, and more than that (b) also reduce how much updates they need to handle.
There are pending work on Istio side to support incremental network update, but it is still in design phase.
What I want is to stop injecting knowledge about the autoscaler into the deployment from within the revision controller which is a layering violation (@mattmoor). I just want pods to expose their metrics in the standard Prometheus format and whatever autoscaling implementation is in charge of the PodAutoscaler
will go get it.
The limitations I'm trying to work around are:
How about this as a way to bootstrap metrics into the mesh? Create a daemonset pusher pair, one within the mesh and one without. The pusher outside (A) scrapes pod metrics and the pusher inside (B) forwards them to the autoscaler over mTLS.
cc @mattmoor @evankanderson @vaikas-google @mdemirhan @tcnghia
I want to make sure I understand the comment in https://github.com/knative/serving/issues/2203#issuecomment-435511406. Right now, the only way to access a pod across the service mesh is when there is a VirtualService entry which includes the Pod. For various reasons (scalability, sanity, latency), we don't want to create a VirtualService for each Pod to enable individual addressability.
A few thoughts:
POST /scalemetrics
==> streaming body request of metric values and/or gRPC equivalent of the same)?Item 1 seems particularly attractive, if we can use something like GRPC or a streaming response to collect items at a fine granularity
@evankanderson we push today, which has some problems:
Switching to a pull model (theoretically) enables a trusted process to ask for request metrics from untrusted pods with a stronger sense of identity.
I'll pitch a third (strawman) alternative: expose the metrics endpoint via a Service on the mesh, and have the endpoint return the name of the Pod serving the request, then sample over the Service according to the scale of the Revision (assuming reasonable load balancing, this should get us pretty close).
@mattmoor, your third strawman proposal is not so bad. One thing we have going for us is that we don't have to have all the data points to determine what the likely average pod request concurrency is. We can sample the population.
Because Istio is load-balancing, I would expect population variance to be pretty low. E.g. if traffic spikes, I would expect to see a spike in traffic across most of the pods. I think that 16 samples would be enough. And we can collect those over the 6-second panic window so querying a revision metric service at 3 qps should be enough. @glyn and @jchesterpivotal, what do you think?
I would propose that the PodAutoscaler Reconciler would own the metrics service. It can set it up for kpa
class PodAutoscalers and get the labels from the scaleRef target.
Random Internet calculator says ...
Also, sampling will dramatically reduce the load on the autoscaler for large revisions. E.g. for a revision with 1000 pods, the autoscaler will only need to process 3 data points per second instead of 1000.
When a sample turns out to be inaccurate, which it is bound to be some of the time, the autoscaler will either under- or over-scale. Under-scaling will presumably impact response time; over-scaling will increase costs unnecessarily. A later sample may correct the problem although the overall effect will be some measure of instability, possibly even under a fairly constant load.
Perhaps decreasing the sample size as load increases, as I think @mattmoor may have been suggesting, would help?
Yes, I think reducing sample rate as the total load increases is fine.
The overall behaviour of the system should be favourably non-linear with scale, since small populations are noisier and pooling sources of variability can lower average variability. As load goes up, service levels should become more stable for any given amount of change in load.
I haven't thought through how this would affect a PID controller. I assume that one or more of the PID coefficients would change according to the total load situation -- for example, do we make the controller more sensitive to short term fluctuations (P, D) if load is lower?
I haven't thought through how this would affect a PID controller. I assume that one or more of the PID coefficients would change according to the total load situation [...]
I suspect that a PID controller with variable coefficients opens up a whole realm of possibilities, some good and some not so good.
Touché.
A later sample may correct the problem although the overall effect will be some measure of instability, possibly even under a fairly constant load.
The affects of a bad sample will be more pronounced in the 6-second panic window (rather than the 60-second stable window). But the panic window is only used to scale when the autoscaler is panicking. And then it never decreases the number of pod, precisely to prevent this kind of flapping (instability).
With 3 qps, a single bad sample will be averaged with 180 other samples. So the effect of a bad sample will be not so much.
Additionally, we can add some hysteresis to dampen the noise. This is something I wanted to do anyway because we do see a little bit of hopping between two values.
@glyn, @jchesterpivotal what do you think about the 3 qps value? That's 3 qps for the whole revision. So the 6-second sample size will always be 18. And the 60-second sample size will always be 180. Even at high scale.
I'm going to move forward with the proposal to sample a metrics service, owned by the KPA: https://github.com/knative/serving/issues/2203#issuecomment-436039743
We can continue to discuss what the appropriate rate per scale is.
I think 3 samples/sec is fine; we can tune it easily enough with further testing and simulation.
One interesting point @tcnghia brought up is the accuracy of our scale-to-zero signal when we are sampling metrics (instead of collecting them all). I think we are still okay because 1) requests will be routed through the KBuffer before we turn down the Deployment, and the KBuffer still pushes metrics to the Autoscaler. And 2) because before scale-to-zero, the Deployment will be at low scale (likely 1 or close to it) and we will sample that one pod 3 times per second, so we're unlikely to miss a request. That kind of race condition already existed anyway, that a request comes in after we decide to scale to zero and we miss it. Thus the grace period.
@glyn, @jchesterpivotal what do you think about the 3 qps value?
I have no idea whether this will be sufficient. I think you'll need to try it, or at least model it fairly realistically, to have any confidence. (This feels very much like the problem of performance optimisation and the same dictum probably applies: "Measure, don't guess".)
/assign @yanweiguo as I will be out for the next 2 weeks.
@k4leung4: GitHub didn't allow me to assign the following users: as, I, will, be.
Note that only knative members and repo collaborators can be assigned. For more information please see the contributor guide
@k4leung4: GitHub didn't allow me to assign the following users: Yanwei, Guo.
Note that only knative members and repo collaborators can be assigned. For more information please see the contributor guide
/assign @yanweiguo
@yanweiguo you might wait until next week to start on this when I'm back in Seattle (Dec 3rd). As part of adding the autoscaling.knative.dev/target
annotation (https://github.com/knative/serving/issues/2205), I'm relayering the KPA metrics interface to allow KPA autoscaler targets to be updated. I want to avoid any massive code merge conflicts. 🙂
Also, it seems that Envoy ignores ports not mentioned in the K8s Service so you may be able to scrape metrics directly after all. It's worth reconsidering which approach to take (direct or service, sample or exhaustive).
I was discussing Prometheus doing the scraping for us with @k4leung4 yesterday and wanted to summarize a couple of cons of that approach in case it is being considered the path to move forward:
Current behavior
Currently the autoscaler gets it's metrics directly from the queue-proxy. (The queue pushes to the autoscaler).
Desired behavior
Instead we want to collect pod metrics for a Prometheus endpoint on the pods (https://github.com/knative/serving/issues/2202). Either by scraping with Prometheus or directly with the autoscaler.
Example: Prometheus scraping the activator
https://github.com/knative/serving/blob/c6d9dc9409114a831aacd6f59b727c90f5901064/config/monitoring/200-common/300-prometheus/100-scrape-config.yaml#L56