openkruise / kruise

Automated management of large-scale applications on Kubernetes (incubating project under CNCF)
https://openkruise.io
Other
4.61k stars 760 forks source link

Adaptive schedule strategy for UnitedDeployment #1720

Open AiRanthem opened 4 weeks ago

AiRanthem commented 4 weeks ago

Ⅰ. Describe what this PR does

Added an adaptive scheduling strategy to UnitedDeployment. During scaling up, if a subset causes some Pods to be unschedulable for certain reasons, the unschedulable Pods will be rescheduled to other partitions. During scaling down, if elastic allocation is used (i.e., the subset is configured with min/max), each partition will retain the ready Pods as much as possible without exceeding the maximum capacity, rather than strictly scaling down in reverse order of the Subset list.

Ⅱ. Does this pull request fix one issue?

fixes #1673

Ⅲ. Describe how to verify it

Use the yaml below to create a UD with subset-b unschedulable.

apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
  name: sample-ud
spec:
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: sample
  template:
    deploymentTemplate:
      metadata:
        labels:
          app: sample
      spec:
        selector:
          matchLabels:
            app: sample
        template:
          metadata:
            labels:
              app: sample
          spec:
            terminationGracePeriodSeconds: 0
            containers:
              - name: nginx
                image: curlimages/curl:8.8.0
                command: ["/bin/sleep", "infinity"]
  topology:
    scheduleStrategy:
      type: Adaptive
      adaptive:
        rescheduleCriticalSeconds: 10
        unschedulableLastSeconds: 20

    subsets:
      - name: subset-a
        maxReplicas: 2
        nodeSelectorTerm:
          matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - ci-testing-worker
      - name: subset-b
        maxReplicas: 2
        nodeSelectorTerm:
          matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - not-exist
      - name: subset-c
        nodeSelectorTerm:
          matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
                - ci-testing-worker3
  1. when created, two pods in subset-b will stay pending
  2. after 10s, the two pending pods will be rescheduled to subset-c
  3. scale up immediately, new pods will be created in subset-c instead of subset-b (even not full)
  4. wait 20s, when subset-b is recovered, scale up again, 2 pods will be scheduled into subset-b again (and still pending)
  5. whenever you scale down: subset-c -> subset-b -> subset-a

Ⅳ. Special notes for reviews

  1. adapter.go: GetReplicaDetails returns pods in the subset
  2. xxx_adapter.go: return pods implementation ⬆️
  3. allocator.go: about safeReplica
  4. pod_condition_utils.go: extract PodUnscheduledTimeout function from workloadwpread
  5. reschedule.go: PodUnscheduledTimeout function extracted
  6. subset.go: add some field to Subset object to carry related information
  7. subset_control.go: store subset pods to Subset object
  8. uniteddeployment_controller.go
    1. add requeue feature to check failed pods
    2. subset unschedulable status management
  9. uniteddeployment_types.go: API change
  10. uniteddeployment_update.go: sync unschedulable to CR
kruise-bot commented 4 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign fei-guo for approval by writing /assign @fei-guo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openkruise/kruise/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
codecov[bot] commented 4 weeks ago

Codecov Report

Attention: Patch coverage is 49.30876% with 110 lines in your changes missing coverage. Please review.

Project coverage is 49.39%. Comparing base (0d0031a) to head (378185c). Report is 94 commits behind head on master.

Files with missing lines Patch % Lines
.../util/expectations/resource_version_expectation.go 0.00% 23 Missing :warning:
...er/uniteddeployment/uniteddeployment_controller.go 80.00% 10 Missing and 5 partials :warning:
...deployment/adapter/advanced_statefulset_adapter.go 0.00% 13 Missing :warning:
...oller/uniteddeployment/adapter/cloneset_adapter.go 0.00% 13 Missing :warning:
...ler/uniteddeployment/adapter/deployment_adapter.go 0.00% 13 Missing :warning:
...er/uniteddeployment/adapter/statefulset_adapter.go 0.00% 13 Missing :warning:
...roller/uniteddeployment/uniteddeployment_update.go 0.00% 8 Missing and 1 partial :warning:
pkg/controller/uniteddeployment/allocator.go 80.64% 4 Missing and 2 partials :warning:
pkg/controller/uniteddeployment/subset_control.go 76.92% 2 Missing and 1 partial :warning:
pkg/controller/workloadspread/reschedule.go 50.00% 1 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1720 +/- ## ========================================== + Coverage 47.91% 49.39% +1.48% ========================================== Files 162 191 +29 Lines 23491 19728 -3763 ========================================== - Hits 11256 9745 -1511 + Misses 11014 8719 -2295 - Partials 1221 1264 +43 ``` | [Flag](https://app.codecov.io/gh/openkruise/kruise/pull/1720/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openkruise) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/openkruise/kruise/pull/1720/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openkruise) | `49.39% <49.30%> (+1.48%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openkruise#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.