openkruise / kruise

Automated management of large-scale applications on Kubernetes (incubating project under CNCF)
https://openkruise.io
Other
4.66k stars 767 forks source link

[feature request] Sidecarset canary support #1443

Closed jzeng4 closed 9 months ago

jzeng4 commented 1 year ago

Context We leverage the sidecarset to inject sidecar applications into main application pods. In this case, there can be multiple main application containers and multiple sidecar containers in one pod.

Our goal is to canary sidecar containers in some pods where the main applications are also running before rolling out to all the pods. We found two features that we can leverage:

  1. Selector
  2. Select revision via custom version label

For 1, we identified a feature gap where a new canary pod is not created (by cloneset or deployment) if the original canary pod is removed (e.g. rescheduled to other hosts).

For 2, we did an in-depth experiments and found that the existing rebalance mechanisms will change the canary replicas.

What would you like to be added: We want to have a feature of "stable canary". Specifically, it includes the following requirements:

Why is this needed: We have downstream services to monitor the canary pod status and do analysis. If the canary pod is impacted, the analysis will fail, resulting in the seriously impact on the whole deployment workflow.

zmberg commented 1 year ago

@jzeng4 Can Partition and Select revision via custom version label fulfill your request?

shfeng1152 commented 1 year ago

Hi @zmberg, thanks for quick reply. We are from Linkedin and are trying to leverage OpenKruise to build sidecar deployment experience in Linkedin for global sidecar.

Partition and Select revision via custom version label seems designed to control number/percentage of old version but use case requires us to control number of canary sidecar(new version) instead of stable sidecar(old version). We did tried Partition with custom version label and had a few issues:

  1. If we choose to use absolute number for partition(old version), then we can't guarantee absolute number of canary version during scale up/down. i.e. if we have 10 replicates in total and want canary pod to be 1, then partition will be 9. when auto-resizing happens and increase replicates to 12, we will still have 9 old version but 2 canary version(instead of 1). In this case, we "over" canary. Similar, if we decrease replicate to 9, then we will end up with 9 old version and 0 canary version.

  2. If we choose to use percentage for partition, it works well when scale up since it can rebalance it. However, this logic is not being implemented when scale down(https://github.com/openkruise/kruise/blob/master/pkg/controller/sidecarset/sidecarset_pod_event_handler.go#L58).

  3. Besides, we also noticed that when we rolling update main app, we also count terminating pods as matching pods.

  4. partition: 80%, total replicates are 5. so we have 4 old version and 1 canary version

  5. change main app version and set maxSurge and maxUnavailable to 1 (we can see 5 existing pod will be terminating and 5 new pod will be recreated)

  6. It end up 3 old version and 2 canary version. (sidecarset reconcile will get 10 matching pod and then re-calculate: 2 canary version and 8 old version with 5 of them terminate eventually).

For 1, can we add new partition field that allow user to control number/percentage of new version?

For 2, wondering whether we intend not having rebalance logic when pod get deleted? Should we add similar rebalance logic when pod get deleted?

zmberg commented 1 year ago

@shfeng1152 Overall it is still complicated, is it convenient to communicate at the community meeting next Thursday (11.16) at 19:30 with dingding meeting?

shfeng1152 commented 1 year ago

@shfeng1152 Overall it is still complicated, is it convenient to communicate at the community meeting next Thursday (11.16) at 19:30 with dingding meeting?

sure, we can attend, what's the timezone? Is there any instruction how to join dingding meeting?

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.