projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.93k stars 1.32k forks source link

CPU and memory recommendations for Calico components #5418

Open mattstam opened 2 years ago

mattstam commented 2 years ago

Related to this issue on in the AKS repo: https://github.com/Azure/AKS/issues/2642#event-5668547875. For reproduction steps and more context, please view.

Expected Behavior

Ideally, we should have some resource requests limits set for all components in calico.

Current Behavior

No CPU or memory request limits set, leading to possible disruptions in service.

Context / Possible Solutions

AKS could set these in theory, but nobody so far has good answers for what recommended amounts or even general ranges would be. @song-jiang and @lmm indicated that the usage will vary wildly based on cluster size.

Solution 1

One issue is that the setting of these limits is tied to the Installation CR, which AKS needs to reconcile. A possible solution would be to break out CPU/Memory limit setting into its own CR, and let users write there own values (though they still need to figure out the appropriate values)

Solution 2

It might also be possible for https://github.com/tigera/operator to dynamically update these limits based on some heuristic like number of nodes or services.

Thoughts on these solutions or others would be appreciated, thanks.

caseydavenport commented 2 years ago

One issue is that the setting of these limits is tied to the Installation CR, which AKS needs to reconcile

I think this might be the way to go. I really hesitate to do any sort of heuristic-based approach, because it will be wrong for somebody. We could do both - heuristic + a manual override, but either way we need the manual override.

which AKS needs to reconcile.

Out of curiosity, why does AKS need to reconcile this? What are the settings in there today that would not be acceptable for users to modify?

We might want to consider a further breakdown of that resource into "things that must be set by the cluster provisioner and cannot be changed" and "important installation parameters that can be tweaked at runtime".

@tmjd will be interested in this as well.

mattstam commented 2 years ago

AKS reconciles this because Installation CR needed specific settings if using Calico in place of kubenet, or chained on-top of Azure CNI to provide network policies.

Since AKS has switched to operator --manage-crds option, I believe operator applies an overlay Installation spec. So even if we didn't reconcile the base one, the ComponentResource values likely would be overridden anyway.

Definitely breaking this CRD down into things the cluster provider needs to set vs. runtime customization would be good, if a specific CRD just for ComponentResource isn't an option.

mattstam commented 2 years ago

In the meantime, would it be possible for non-enterprise users to use this overlay Installation to override ComponentResource value set by the regular Installation that the cloud provider reconciles?

Would be great if an example could be given as this overlay functionality seems to be less documented.

cc @caseydavenport @tmjd

tmjd commented 2 years ago

Yes it is possible to use the overlay to override the ComponentResource. Simply to use an overlay, a new Installation resource needs to be created with the name overlay instead of default. I'd suggest only including the ComponentResource to ensure no other fields are overridden. So it would be something like:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: overlay
spec:
  componentResources:
  - componentName: Typha
    resourceRequirements:
      limits: ....
      requests: ....

IIRC the combined/merged Installation should be visible in the Installation default status field.