knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.54k stars 1.15k forks source link

Switch Activator to DaemonSet #6550

Closed vagababov closed 3 years ago

vagababov commented 4 years ago

In what area(s)?

/area autoscale

Describe the feature

Currently we un Activator as a deployment with an HPA autoscaler. For Node Local Scheduling, we need to run Activator as a DaemonSet. The work here is consisted of several steps:

  1. Find the appropriate instance requirements.
    • Currently we're requesting 300m of CPU and 400MB of memory for each instance. This presumes running just a few instances.
    • Our integration tests explicitly ask for 2 instances.
    • For benchmarks we are asking for 10-15 depending on the test, though they are strongly over-provisioned there). On my cluster I have reliably achieved desired results with 4-5 instances per 1000RPS and 150 target pods with CC=1 on 15 node cluster.
    • If we're going to switch to a DaemonSet, we probably need to lower these number somewhat.
    • We need to measure against the benchmarks to make sure the current performance levels do no degrade.
  2. Provide optional template that permits running Activator as DS in v0.13.
  3. Switch DaemonSet as default and set HPA min limits to 0 and deployment replicas to 0 in v0.14.
  4. [possibly] Remove the existing configuration in v0.15

/assign @vagababov /cc mattmoor @markusthoemmes

mattmoor commented 4 years ago

We want the DaemonSet to exist for sure for an overlapping release with the Deployment before we scale it down, otherwise we are racing the creation of new pods against collecting old pods.

When we create the DS the deployment should naturally scale down because so much of its load should be offloaded.

I think the only other nit is that step 2 should set replicas to zero on the deployment, as I doubt that doing so on the HPA scales it to zero.

markusthoemmes commented 4 years ago

Is the decision to make the activator our scheduling unit final? I still think that the activator might need to scale differently than „per node“ for its dataplane duties 🤔

mattmoor commented 4 years ago

I still think that the activator might need to scale differently than „per node“ for its dataplane duties.

Can you elaborate?

markusthoemmes commented 4 years ago

Sure, With OpenWhisk we had a similar model of having the scheduling components distributed per node and each scheduled local containers. That was a mistake but we‘re not 100% in the same situation here. However, for OpenWhisk it wouldve been very beneficial to separate the scheduling components (which can be per node for local state) from the data plane components which can in theory handle multiple nodes, or if the load is super high, share the load of one node between them.

With the activator handling more and more traffic in more and more situations I wonder if we want to tie its scale to the number of nodes. That for example prohibits us moving them to some very large machines and run each on a massive CPU. That would currently be beneficial for our perfect loadbalancer for example. The less replicas the better.

I‘m not saying a DaemonSet is a nono. I‘d just like us to think about the notion of tying this data plane component‘s scale to the number of nodes and its consequences.

vagababov commented 4 years ago

I am actually fine (and I considered that as an option in the document (in progress)) for keeping both per-node and HPA scaled Activators. But significantly reducing the resources allocated to the free roaming activators.

vagababov commented 4 years ago

@mattmoor : updated step (3) and added numbering for more precise attribution.

mattmoor commented 4 years ago

Couple things:

JRBANCEL commented 4 years ago

Are we going to use .spec.template.spec.nodeSelector? Running Activator on every node is crazy. It is going to prevent anyone with a cluster with many nodes from using Knative. Do we really need Activator to be a DS for Local Scheduling?

mattmoor commented 4 years ago

Running Activator on every node is crazy. It is going to prevent anyone with a cluster with many nodes from using Knative.

Let's not deal in FUD here. What are the concrete concerns?

Do we really need Activator to be a DS for Local Scheduling?

Sure it's a blunt instrument, we could probably write our own autoscaler and scheduler for the activator to do something a bit more curated, but again: we shouldn't just do that because of FUD.

markusthoemmes commented 4 years ago

So the concern (based on OpenWhisk experience) from my end is:

Conflating data-plane with control-plane. I think it's a valuable use-case to deploy the activators on a few beefy machines to a) make them quick and b) guard them from noisy neighbor effects on the worker nodes themselves, where just about any other workload can run. It shields the data-plane from an application going haywire and thus causing havok on other app's latencies.

Arguably, a lot of the OpenWhisk concerns don't apply here because

  1. The activator's overhead is quite small (OpenWhisk's Scala based stack is not)
  2. We don't necessarily constraint routing in the activator to just the local containers, so load can still even itself out across all of the activators.
markusthoemmes commented 4 years ago

Folks can tweak what we release, and adding a NodeSelector or using a Deployment with anti-affinity for their own pods both seem like viable ways to not need a whole DaemonSet.

That's right but if we build the scheduling capabilities (spawning of containers locally) into the activator, we're binding these mechanisms into the deployment of the activator. If I want to limit where they get deployed I now also limit the nodes where I can make use of NodeLocalScheduling for resiliency and potentially performance.

I guess the underlying question is: Do we want that behavior? Do we want to constraint the nodes we can schedule to by the nodes that have activators deployed? And if we want to do that: Why?

vagababov commented 4 years ago

What if we follow the following strategy:

Now second set can scale as needed and when used for load balancing even under load should not affect scale from 0. First set is needed only for the small loads and should be able to handle whatever we through at it.

With that being said, we have no real idea about how is it going to perform in real world (re. FUD remark from @mattmoor) -- so why don't we just test and see?

Part of the exercise I am trying with this issue, is to verify the benchmarks still succeed successfully, when Activators run as DS.

mattmoor commented 4 years ago

Do we want to constraint the nodes we can schedule to by the nodes that have activators deployed? And if we want to do that: Why?

I could see an argument for doing this around limiting the places where the activator has privilege on nodes (if/when we can interact with the kubelet).

With that being said, we have no real idea about how is it going to perform in real world

Yeah, if we have concerns based on scenarios or data great! I'd like benchmarks that demonstrate problems (IIRC today we simply overprovision the activator for all our benchmarks, let's stop). If the concern is resource overhead on large clusters with small ksvc footprints, then the simple answer is that adding a nodeSelector isn't particularly intrusive (e.g. the addon manager should respect it in 3-way merge unless we add one, the operator would need a similar provision or it would undo scaling of the activator anyways since it would strip the added replicas: N), or folks can replace the daemonset with a Deployment configured with anti-affinity with itself and HPA.

Right-sizing the activator for the amount of burst a cluster might receive is going to be an art-form, and I'd rather bias toward overprovisioning it with a way to tamp it down (at which point the human Operator takes responsibility for scale failures) than have scale-failing be our out-of-box experience because the HPA's minScale is only 1, which can only absorb so much. We have had an open issue for a "knative provisioning guide" for some time, it'd be good to put some thought there.

mattmoor commented 4 years ago

so why don't we just test and see?

💯

markusthoemmes commented 4 years ago

Works for me, these were theoretical concerns and I don‘t have a specific workload I could provide a benchmark for to proof them.

Definitely not intending to block progression here. We can always iterate!

nimakaviani commented 4 years ago

with activators running as DS, one side benefit I am interested to see is to have the autoscaler directly scrape the activators for metrics and move on from sampling, so hopefully improved scaling decisions.

@markusthoemmes can you elaborate on the following?

With OpenWhisk we had a similar model of having the scheduling components distributed per node and each scheduled local containers. That was a mistake ...

do you mean the mix up between data plane vs control plane responsibilities? anything specific?

Also in the original proposal for Node Local Scheduling, the plan was to only use it when scaling from 0. Basically to have the activator be directly responsible for creating pod definitions early on and fall back on using pods created using the replica set when they become ready. is that still the plan?

vagababov commented 4 years ago

Activators currently push metrics to the Autoscaler. So there's no sampling right now.

nimakaviani commented 4 years ago

I mean the autoscaler scraping individual pod metrics endpoints. It is still a big part of making autoscaling decisions.

vagababov commented 4 years ago

Yeah, but making Activator a DS does not really help here, or I fail to see how.

nimakaviani commented 4 years ago

we do sampling partly because scale is a concern (i.e. the number of pods to scrape and the individual network roundtrips). With the activators running as DS, we can potentially drop collecting per pod metrics via scraping metric endpoints directly and collect stats on all the pods running on the same node in one swoop from the activator. so the effort involved in collecting metrics is bound by the number of number of nodes as opposed to the number of pods.

markusthoemmes commented 4 years ago

Sure, a hierarchical approach to metric collection could improve our situation, but then again that‘s somewhat orthogonal right? It doesn‘t need to be the activator either? We could contrive a component that runs on each node, scrapes all metrics and forwards them to the autoscaler or is scraped by the autoscaler. I wouldn‘t want to conflate that with NLS necessarily.

nimakaviani commented 4 years ago

yeah totally. Prior to this discussion, I was in fact thinking of putting a proposal together to exactly suggest that, ie to have a DS that does metrics collection and reporting on a per node basis. Since improvements in the activator have enabled it already to do most of that book keeping tho, enabling the DS activator to relay that info could be quite trivial. But like you said, it can be done completely independently too.

MIBc commented 4 years ago

I want to ask a question. All packets will go through activitor if we use kpa. Is it Right ? Is there a highly available solution for activitor?

markusthoemmes commented 4 years ago

@MIBc

I want to ask a question. All packets will go through activitor if we use kpa. Is it Right ? Is there a highly available solution for activitor?

It's more nuanced than that but the activator can be scaled out already. Can you please move that question to Slack and/or a seperate Github issue to not derail this thread too much.

mattmoor commented 4 years ago

to have a DS that does metrics collection and reporting on a per node basis

My main concern with conflating this with the activator is that it requires the activator to be running everywhere we might run a serving pod, otherwise we lose metrics. Activator->DaemonSet would certainly be a step towards being able to do that, but we should weigh carefully whether we want to take the additional steps (I'm not for/against this, just trying to think through the implications to make an informed judgement).

Since we're mildly tangenting anyways, another way of doing this (that I just though of) would be to piggyback on the ideal loadbalancing work from @vagababov which partitions the pod space across the activators and is less oriented around colocation.

markusthoemmes commented 4 years ago

Don't underestimate the power of meshes that don't allow podIP-connectivity! 👿

vagababov commented 4 years ago

The load test results are in. The spike is when I had both HA free-floating and daemon-set together. It seems like those two don't play nicely together :(

vagababov commented 4 years ago

So the actual implementation here is done, though it opened its own can of worms... So I guess we can close it for now, and reopen if we decide to rollback the change

vagababov commented 4 years ago

/close

knative-prow-robot commented 4 years ago

@vagababov: Closing this issue.

In response to [this](https://github.com/knative/serving/issues/6550#issuecomment-583140723): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
vagababov commented 4 years ago

/reopen rolling back

knative-prow-robot commented 4 years ago

@vagababov: Reopened this issue.

In response to [this](https://github.com/knative/serving/issues/6550#issuecomment-583516045): >/reopen >rolling back Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
markusthoemmes commented 4 years ago

Is there a design doc that describes why we need to do this somewhere? I'd like to challenge this once more I think :sweat_smile:

markusthoemmes commented 4 years ago

Moved this to the ice-box for now as I don't think it's been actively worked on. Is this still relevant (see comment about design doc describing the need).

If not, could/should we close this and reopen once we're starting to get at it?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

vagababov commented 3 years ago

/remove-lifecycle stale

markusthoemmes commented 3 years ago

Discussed with peeps that this is not actionable today, so I'm closing this for now. We'll revisit the reason this was opened in the first place and reopen if appropriate.

zouyee commented 3 years ago

/cc