The default limit of 1 CPU for Stargate should be removed

k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra

https://k8ssandra.io/

Apache License 2.0

170 stars 78 forks source link

The default limit of 1 CPU for Stargate should be removed #470

Open adejanovski opened 2 years ago

adejanovski commented 2 years ago

If no resource request is made, we're limiting Stargate to a single CPU. Stargate being, like Cassandra, a highly multithreaded app, the default restriction will severely impact its performance. Since this is configurable, I think we should remove that default and let Stargate use as many cores as needed.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-144

bradfordcp commented 2 years ago

I'd like to see a little discussion here on the ticket with pros & cons from the team.

adutra commented 2 years ago

The limitation comes from K8ssandra 1.x charts:

  # -- Sets the CPU request for the Stargate pod in millicores.
  cpuReqMillicores: 200
  # -- Sets the CPU limit for the Stargate pod in millicores.
  cpuLimMillicores: 1000

Cassandra pods by default do not have this limitation. The only limitation imposed by cass-operator is that resource limits must be set if allowMultipleNodesPerWorker is true:

        if dc.Spec.AllowMultipleNodesPerWorker {
        if dc.Spec.Resources.Requests.Cpu().IsZero() ||
            dc.Spec.Resources.Limits.Cpu().IsZero() ||
            dc.Spec.Resources.Requests.Memory().IsZero() ||
            dc.Spec.Resources.Limits.Memory().IsZero() {

            return attemptedTo("use multiple nodes per worker without cpu and memory requests and limits")
        }
    }

Given that, I'd be inclined to follow @adejanovski 's suggestion to relax this constraint. We might want however to enforce that limits are set if SoftPodAntiAffinity is true in the Stargate spec.

jsanda commented 2 years ago

I see value in setting resource requests and limits for both Stargate and Cassandra particularly if they will be co-located with other pods. In addition to that it helps the k8s scheduler do its job. With that said, I am not seen on specifying somewhat arbitrary settings. We won't need the same cpu settings for dev/testing as we would for an actual production deployment.

We could implement some notion of runtime profiles. It could be as simple as specify some different defaults. This could be done in the operation configuration (see #63 for background).

adejanovski commented 2 years ago

@jsanda, how do we move forward with this ticket? Should we take this into a meeting?

jsanda commented 2 years ago

Based on https://www.reddit.com/r/kubernetes/comments/all1vg/on_kubernetes_cpu_limits/ which @bradfordcp pointed out (thanks for that 😃), I am in favor of removing the limit.

Miles-Garnsey commented 2 years ago

I’d consider this more carefully, why are we proposing to remove resource limits? Could we simply increase our resource limits to the Stargate recommended minimum?

Whatever happens by default doesn’t matter too much, but (as someone has pointed out in a GH issue) but there are a lot of environments where pods won’t be scheduled without resource limits AND requests.

I actually see this as good practice, because it prevents the Kubelet being starved for resources (which can cause a whole-node failure). This is particularly important for a pod like Stargate where resource consumption will likely scale with number of requests (imagine a cyber monday event and the traffic peaks that can be observed).

It seems that there are some complexities depending on which kernel version you’re on (as at least one has a bug) and what scheduler you’re using. But that doesn't diminish the fact that an overloaded node may completely fail and take down other pods - some of which may be system-node-critical QOS. That's bad behaviour.

jsanda commented 2 years ago

As discussed in detail in the Reddit thread, specifying limits without really understanding what you're doing could result in processes being starved of cpu. Specifying requests is probably a good rule of thumb, but I think we should be cautious about setting limits for cpu intensive containers like Stargate and Cassandra.