Closed vagababov closed 3 years ago
We want the DaemonSet to exist for sure for an overlapping release with the Deployment before we scale it down, otherwise we are racing the creation of new pods against collecting old pods.
When we create the DS the deployment should naturally scale down because so much of its load should be offloaded.
I think the only other nit is that step 2 should set replicas to zero on the deployment, as I doubt that doing so on the HPA scales it to zero.
Is the decision to make the activator our scheduling unit final? I still think that the activator might need to scale differently than „per node“ for its dataplane duties 🤔
I still think that the activator might need to scale differently than „per node“ for its dataplane duties.
Can you elaborate?
Sure, With OpenWhisk we had a similar model of having the scheduling components distributed per node and each scheduled local containers. That was a mistake but we‘re not 100% in the same situation here. However, for OpenWhisk it wouldve been very beneficial to separate the scheduling components (which can be per node for local state) from the data plane components which can in theory handle multiple nodes, or if the load is super high, share the load of one node between them.
With the activator handling more and more traffic in more and more situations I wonder if we want to tie its scale to the number of nodes. That for example prohibits us moving them to some very large machines and run each on a massive CPU. That would currently be beneficial for our perfect loadbalancer for example. The less replicas the better.
I‘m not saying a DaemonSet is a nono. I‘d just like us to think about the notion of tying this data plane component‘s scale to the number of nodes and its consequences.
I am actually fine (and I considered that as an option in the document (in progress)) for keeping both per-node and HPA scaled Activators. But significantly reducing the resources allocated to the free roaming activators.
@mattmoor : updated step (3) and added numbering for more precise attribution.
Couple things:
Folks can tweak what we release, and adding a NodeSelector or using a Deployment with anti-affinity for their own pods both seem like viable ways to not need a whole DaemonSet. Unless we're talking about a scenario where # activators > # nodes
, which would be a use case I'd like to understand better.
To be pedantic, we're making the scheduling choice ahead-of-time and programming the options into the load-balancer's Endpoints, we aren't empowering the activator to spin up pods anywhere but their own node. Perhaps you are referring to spinning up the pod as "scheduling", in which case I'm just splitting hairs, but I want to make sure we're talking about the same thing.
Are we going to use .spec.template.spec.nodeSelector
?
Running Activator on every node is crazy. It is going to prevent anyone with a cluster with many nodes from using Knative.
Do we really need Activator to be a DS for Local Scheduling?
Running Activator on every node is crazy. It is going to prevent anyone with a cluster with many nodes from using Knative.
Let's not deal in FUD here. What are the concrete concerns?
Do we really need Activator to be a DS for Local Scheduling?
Sure it's a blunt instrument, we could probably write our own autoscaler and scheduler for the activator to do something a bit more curated, but again: we shouldn't just do that because of FUD.
So the concern (based on OpenWhisk experience) from my end is:
Conflating data-plane with control-plane. I think it's a valuable use-case to deploy the activators on a few beefy machines to a) make them quick and b) guard them from noisy neighbor effects on the worker nodes themselves, where just about any other workload can run. It shields the data-plane from an application going haywire and thus causing havok on other app's latencies.
Arguably, a lot of the OpenWhisk concerns don't apply here because
Folks can tweak what we release, and adding a NodeSelector or using a Deployment with anti-affinity for their own pods both seem like viable ways to not need a whole DaemonSet.
That's right but if we build the scheduling capabilities (spawning of containers locally) into the activator, we're binding these mechanisms into the deployment of the activator. If I want to limit where they get deployed I now also limit the nodes where I can make use of NodeLocalScheduling for resiliency and potentially performance.
I guess the underlying question is: Do we want that behavior? Do we want to constraint the nodes we can schedule to by the nodes that have activators deployed? And if we want to do that: Why?
What if we follow the following strategy:
Now second set can scale as needed and when used for load balancing even under load should not affect scale from 0. First set is needed only for the small loads and should be able to handle whatever we through at it.
With that being said, we have no real idea about how is it going to perform in real world (re. FUD remark from @mattmoor) -- so why don't we just test and see?
Part of the exercise I am trying with this issue, is to verify the benchmarks still succeed successfully, when Activators run as DS.
Do we want to constraint the nodes we can schedule to by the nodes that have activators deployed? And if we want to do that: Why?
I could see an argument for doing this around limiting the places where the activator has privilege on nodes (if/when we can interact with the kubelet).
With that being said, we have no real idea about how is it going to perform in real world
Yeah, if we have concerns based on scenarios or data great! I'd like benchmarks that demonstrate problems (IIRC today we simply overprovision the activator for all our benchmarks, let's stop). If the concern is resource overhead on large clusters with small ksvc footprints, then the simple answer is that adding a nodeSelector
isn't particularly intrusive (e.g. the addon manager should respect it in 3-way merge unless we add one, the operator would need a similar provision or it would undo scaling of the activator anyways since it would strip the added replicas: N
), or folks can replace the daemonset with a Deployment configured with anti-affinity with itself and HPA.
Right-sizing the activator for the amount of burst a cluster might receive is going to be an art-form, and I'd rather bias toward overprovisioning it with a way to tamp it down (at which point the human Operator takes responsibility for scale failures) than have scale-failing be our out-of-box experience because the HPA's minScale is only 1, which can only absorb so much. We have had an open issue for a "knative provisioning guide" for some time, it'd be good to put some thought there.
so why don't we just test and see?
💯
Works for me, these were theoretical concerns and I don‘t have a specific workload I could provide a benchmark for to proof them.
Definitely not intending to block progression here. We can always iterate!
with activators running as DS, one side benefit I am interested to see is to have the autoscaler directly scrape the activators for metrics and move on from sampling, so hopefully improved scaling decisions.
@markusthoemmes can you elaborate on the following?
With OpenWhisk we had a similar model of having the scheduling components distributed per node and each scheduled local containers. That was a mistake ...
do you mean the mix up between data plane vs control plane responsibilities? anything specific?
Also in the original proposal for Node Local Scheduling, the plan was to only use it when scaling from 0. Basically to have the activator be directly responsible for creating pod definitions early on and fall back on using pods created using the replica set when they become ready. is that still the plan?
Activators currently push metrics to the Autoscaler. So there's no sampling right now.
I mean the autoscaler scraping individual pod metrics endpoints. It is still a big part of making autoscaling decisions.
Yeah, but making Activator a DS does not really help here, or I fail to see how.
we do sampling partly because scale is a concern (i.e. the number of pods to scrape and the individual network roundtrips). With the activators running as DS, we can potentially drop collecting per pod metrics via scraping metric endpoints directly and collect stats on all the pods running on the same node in one swoop from the activator. so the effort involved in collecting metrics is bound by the number of number of nodes as opposed to the number of pods.
Sure, a hierarchical approach to metric collection could improve our situation, but then again that‘s somewhat orthogonal right? It doesn‘t need to be the activator either? We could contrive a component that runs on each node, scrapes all metrics and forwards them to the autoscaler or is scraped by the autoscaler. I wouldn‘t want to conflate that with NLS necessarily.
yeah totally. Prior to this discussion, I was in fact thinking of putting a proposal together to exactly suggest that, ie to have a DS that does metrics collection and reporting on a per node basis. Since improvements in the activator have enabled it already to do most of that book keeping tho, enabling the DS activator to relay that info could be quite trivial. But like you said, it can be done completely independently too.
I want to ask a question. All packets will go through activitor if we use kpa
. Is it Right ? Is there a highly available solution for activitor?
@MIBc
I want to ask a question. All packets will go through activitor if we use kpa. Is it Right ? Is there a highly available solution for activitor?
It's more nuanced than that but the activator can be scaled out already. Can you please move that question to Slack and/or a seperate Github issue to not derail this thread too much.
to have a DS that does metrics collection and reporting on a per node basis
My main concern with conflating this with the activator is that it requires the activator to be running everywhere we might run a serving pod, otherwise we lose metrics. Activator->DaemonSet would certainly be a step towards being able to do that, but we should weigh carefully whether we want to take the additional steps (I'm not for/against this, just trying to think through the implications to make an informed judgement).
Since we're mildly tangenting anyways, another way of doing this (that I just though of) would be to piggyback on the ideal loadbalancing work from @vagababov which partitions the pod space across the activators and is less oriented around colocation.
Don't underestimate the power of meshes that don't allow podIP-connectivity! 👿
The load test results are in. The spike is when I had both HA free-floating and daemon-set together. It seems like those two don't play nicely together :(
So the actual implementation here is done, though it opened its own can of worms... So I guess we can close it for now, and reopen if we decide to rollback the change
/close
@vagababov: Closing this issue.
/reopen rolling back
@vagababov: Reopened this issue.
Is there a design doc that describes why we need to do this somewhere? I'd like to challenge this once more I think :sweat_smile:
Moved this to the ice-box for now as I don't think it's been actively worked on. Is this still relevant (see comment about design doc describing the need).
If not, could/should we close this and reopen once we're starting to get at it?
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
Discussed with peeps that this is not actionable today, so I'm closing this for now. We'll revisit the reason this was opened in the first place and reopen if appropriate.
/cc
In what area(s)?
/area autoscale
Describe the feature
Currently we un Activator as a deployment with an HPA autoscaler. For Node Local Scheduling, we need to run Activator as a DaemonSet. The work here is consisted of several steps:
300m
of CPU and400MB
of memory for each instance. This presumes running just a few instances./assign @vagababov /cc mattmoor @markusthoemmes