karmada-controller-manager: panic: runtime error: index out of range

tedli commented 9 months ago

What happened:

karmada-controller-manager crashed, caused by panic: runtime error: index out of range

What you expected to happen:

index checked, not to panic, not to crash.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The index may be -1. https://github.com/karmada-io/karmada/blob/2526de856ebe8b4e57cdf5ad9e544604aa9f0e43/pkg/modeling/modeling.go#L162-L168

https://github.com/karmada-io/karmada/blob/2526de856ebe8b4e57cdf5ad9e544604aa9f0e43/pkg/modeling/modeling.go#L105-L114

Environment:

Karmada version: v1.6.1
kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
Others:

liangyuanpeng commented 9 months ago

Any message for reproduce it? and environment of kubernetes cluster and karmada cluster .

RainbowMango commented 9 months ago

The searchLastLessElement might return -1: https://github.com/karmada-io/karmada/blob/cf90ffc64fbfa16083c8371905b4789fb4f98a6f/pkg/modeling/modeling.go#L116-L138

@tedli Do you know how to reproduce it?

tedli commented 9 months ago

Hi @RainbowMango @liangyuanpeng ，

My case as I mententioned, was caused by line 164, 165 https://github.com/karmada-io/karmada/blob/2526de856ebe8b4e57cdf5ad9e544604aa9f0e43/pkg/modeling/modeling.go#L162-L168

And I didn't dive deep to find how to reproduce. I just havd a quick fix, that adding index check, to avoid panic.

RainbowMango commented 9 months ago

/help

Let's call for help to figure out the root cause.

I'm not sure if the author(@halfrost) has time to take a look :)

karmada-bot commented 9 months ago

@RainbowMango: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/karmada-io/karmada/issues/3854): >/help > >Let's call for help to figure out the root cause. > >I'm not sure if the author(@halfrost) has time to take a look :) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

halfrost commented 9 months ago

I can fix this issue. index returns negative one, which means that no matching resource model can be found. Let me write some test code to test it again.

halfrost commented 9 months ago

@tedli Hi, could you provide some parameters to initialize resource modeling? I want to reproduce this issue on my local machine.

tedli commented 9 months ago

Hi @halfrost ,

Sorry for late reply, here is my Cluster cr.

Click to view

```yaml apiVersion: cluster.karmada.io/v1alpha1 kind: Cluster metadata: creationTimestamp: "2023-07-19T07:33:03Z" finalizers: - karmada.io/cluster-controller generation: 3 labels: k8s.qihoo.net/cluster-type: public karmada-cluster.k8s.qihoo.net/managed: "true" name: pub-shbt resourceVersion: "4113183" uid: 49cbf86d-8d65-4f03-bbad-bf78d53b6aa3 spec: apiEndpoint: https://pub-shbt.k8s.qihoo.net id: 447a3118-39a1-4bbd-bd41-539c3071a2e2 impersonatorSecretRef: name: pub-shbt-impersonator namespace: karmada-cluster resourceModels: - grade: 0 ranges: - max: "1" min: "0" name: cpu - max: 4Gi min: "0" name: memory - grade: 1 ranges: - max: "2" min: "1" name: cpu - max: 16Gi min: 4Gi name: memory - grade: 2 ranges: - max: "4" min: "2" name: cpu - max: 32Gi min: 16Gi name: memory - grade: 3 ranges: - max: "8" min: "4" name: cpu - max: 64Gi min: 32Gi name: memory - grade: 4 ranges: - max: "16" min: "8" name: cpu - max: 128Gi min: 64Gi name: memory - grade: 5 ranges: - max: "32" min: "16" name: cpu - max: 256Gi min: 128Gi name: memory - grade: 6 ranges: - max: "64" min: "32" name: cpu - max: 512Gi min: 256Gi name: memory - grade: 7 ranges: - max: "128" min: "64" name: cpu - max: 1Ti min: 512Gi name: memory - grade: 8 ranges: - max: "9223372036854775807" min: "128" name: cpu - max: "9223372036854775807" min: 1Ti name: memory secretRef: name: pub-shbt namespace: karmada-cluster syncMode: Push status: apiEnablements: - groupVersion: acme.cert-manager.io/v1 resources: - kind: Challenge name: challenges - kind: Order name: orders - groupVersion: admissionregistration.k8s.io/v1 resources: - kind: MutatingWebhookConfiguration name: mutatingwebhookconfigurations - kind: ValidatingWebhookConfiguration name: validatingwebhookconfigurations - groupVersion: apiextensions.k8s.io/v1 resources: - kind: CustomResourceDefinition name: customresourcedefinitions - groupVersion: apiregistration.k8s.io/v1 resources: - kind: APIService name: apiservices - groupVersion: apps/v1 resources: - kind: ControllerRevision name: controllerrevisions - kind: DaemonSet name: daemonsets - kind: Deployment name: deployments - kind: ReplicaSet name: replicasets - kind: StatefulSet name: statefulsets - groupVersion: authentication.k8s.io/v1 resources: - kind: TokenReview name: tokenreviews - groupVersion: authorization.k8s.io/v1 resources: - kind: LocalSubjectAccessReview name: localsubjectaccessreviews - kind: SelfSubjectAccessReview name: selfsubjectaccessreviews - kind: SelfSubjectRulesReview name: selfsubjectrulesreviews - kind: SubjectAccessReview name: subjectaccessreviews - groupVersion: autoscaling/v1 resources: - kind: HorizontalPodAutoscaler name: horizontalpodautoscalers - groupVersion: autoscaling/v2 resources: - kind: HorizontalPodAutoscaler name: horizontalpodautoscalers - groupVersion: autoscaling/v2beta1 resources: - kind: HorizontalPodAutoscaler name: horizontalpodautoscalers - groupVersion: autoscaling/v2beta2 resources: - kind: HorizontalPodAutoscaler name: horizontalpodautoscalers - groupVersion: batch/v1 resources: - kind: CronJob name: cronjobs - kind: Job name: jobs - groupVersion: batch/v1beta1 resources: - kind: CronJob name: cronjobs - groupVersion: catalog.cattle.io/v1 resources: - kind: App name: apps - kind: ClusterRepo name: clusterrepos - kind: Operation name: operations - groupVersion: cert-manager.io/v1 resources: - kind: CertificateRequest name: certificaterequests - kind: Certificate name: certificates - kind: ClusterIssuer name: clusterissuers - kind: Issuer name: issuers - groupVersion: certificates.k8s.io/v1 resources: - kind: CertificateSigningRequest name: certificatesigningrequests - groupVersion: cilium.io/v2 resources: - kind: CiliumClusterwideNetworkPolicy name: ciliumclusterwidenetworkpolicies - kind: CiliumEndpoint name: ciliumendpoints - kind: CiliumExternalWorkload name: ciliumexternalworkloads - kind: CiliumIdentity name: ciliumidentities - kind: CiliumNetworkPolicy name: ciliumnetworkpolicies - kind: CiliumNode name: ciliumnodes - groupVersion: cluster.cattle.io/v3 resources: - kind: ClusterAuthToken name: clusterauthtokens - kind: ClusterUserAttribute name: clusteruserattributes - groupVersion: config.gatekeeper.sh/v1alpha1 resources: - kind: Config name: configs - groupVersion: config.koordinator.sh/v1alpha1 resources: - kind: ClusterColocationProfile name: clustercolocationprofiles - groupVersion: constraints.gatekeeper.sh/v1alpha1 resources: - kind: K8sAllowedRegistries name: k8sallowedregistries - kind: K8sDisallowedCaps name: k8sdisallowedcaps - kind: K8sDisallowedHostIpc name: k8sdisallowedhostipc - kind: K8sDisallowedHostNetwork name: k8sdisallowedhostnetwork - kind: K8sDisallowedHostpathVolumes name: k8sdisallowedhostpathvolumes - kind: K8sDisallowedHostPid name: k8sdisallowedhostpid - kind: K8sDisallowedPrivilegedContainer name: k8sdisallowedprivilegedcontainer - kind: K8sDisallowedPrivilegeEscalation name: k8sdisallowedprivilegeescalation - groupVersion: constraints.gatekeeper.sh/v1beta1 resources: - kind: K8sAllowedRegistries name: k8sallowedregistries - kind: K8sDisallowedCaps name: k8sdisallowedcaps - kind: K8sDisallowedHostIpc name: k8sdisallowedhostipc - kind: K8sDisallowedHostNetwork name: k8sdisallowedhostnetwork - kind: K8sDisallowedHostpathVolumes name: k8sdisallowedhostpathvolumes - kind: K8sDisallowedHostPid name: k8sdisallowedhostpid - kind: K8sDisallowedPrivilegedContainer name: k8sdisallowedprivilegedcontainer - kind: K8sDisallowedPrivilegeEscalation name: k8sdisallowedprivilegeescalation - groupVersion: coordination.k8s.io/v1 resources: - kind: Lease name: leases - groupVersion: csi.aliyun.com/v1alpha1 resources: - kind: NodeLocalStorageInitConfig name: nodelocalstorageinitconfigs - kind: NodeLocalStorage name: nodelocalstorages - groupVersion: discovery.k8s.io/v1 resources: - kind: EndpointSlice name: endpointslices - groupVersion: discovery.k8s.io/v1beta1 resources: - kind: EndpointSlice name: endpointslices - groupVersion: events.k8s.io/v1 resources: - kind: Event name: events - groupVersion: events.k8s.io/v1beta1 resources: - kind: Event name: events - groupVersion: expansion.gatekeeper.sh/v1alpha1 resources: - kind: ExpansionTemplate name: expansiontemplate - groupVersion: externaldata.gatekeeper.sh/v1alpha1 resources: - kind: Provider name: providers - groupVersion: externaldata.gatekeeper.sh/v1beta1 resources: - kind: Provider name: providers - groupVersion: flowcontrol.apiserver.k8s.io/v1beta1 resources: - kind: FlowSchema name: flowschemas - kind: PriorityLevelConfiguration name: prioritylevelconfigurations - groupVersion: flowcontrol.apiserver.k8s.io/v1beta2 resources: - kind: FlowSchema name: flowschemas - kind: PriorityLevelConfiguration name: prioritylevelconfigurations - groupVersion: kubeadm.io.qihoo.com/v1 resources: - kind: Kubecluster name: kubeclusters - groupVersion: management.cattle.io/v3 resources: - kind: APIService name: apiservices - kind: AuthConfig name: authconfigs - kind: ClusterRegistrationToken name: clusterregistrationtokens - kind: Cluster name: clusters - kind: Feature name: features - kind: GroupMember name: groupmembers - kind: Group name: groups - kind: Preference name: preferences - kind: Setting name: settings - kind: Token name: tokens - kind: UserAttribute name: userattributes - kind: User name: users - groupVersion: metrics.k8s.io/v1beta1 resources: - kind: NodeMetrics name: nodes - kind: PodMetrics name: pods - groupVersion: monitoring.coreos.com/v1 resources: - kind: Alertmanager name: alertmanagers - kind: PodMonitor name: podmonitors - kind: Prometheus name: prometheuses - kind: PrometheusRule name: prometheusrules - kind: ServiceMonitor name: servicemonitors - kind: ThanosRuler name: thanosrulers - groupVersion: mutations.gatekeeper.sh/v1 resources: - kind: Assign name: assign - kind: AssignMetadata name: assignmetadata - kind: ModifySet name: modifyset - groupVersion: mutations.gatekeeper.sh/v1alpha1 resources: - kind: Assign name: assign - kind: AssignMetadata name: assignmetadata - kind: ModifySet name: modifyset - groupVersion: mutations.gatekeeper.sh/v1beta1 resources: - kind: Assign name: assign - kind: AssignMetadata name: assignmetadata - kind: ModifySet name: modifyset - groupVersion: network.hulk.qihoo.net/v1alpha1 resources: - kind: LoadBalancer name: loadbalancers - groupVersion: networking.k8s.io/v1 resources: - kind: IngressClass name: ingressclasses - kind: Ingress name: ingresses - kind: NetworkPolicy name: networkpolicies - groupVersion: node.k8s.io/v1 resources: - kind: RuntimeClass name: runtimeclasses - groupVersion: node.k8s.io/v1beta1 resources: - kind: RuntimeClass name: runtimeclasses - groupVersion: policy/v1 resources: - kind: PodDisruptionBudget name: poddisruptionbudgets - groupVersion: policy/v1beta1 resources: - kind: PodDisruptionBudget name: poddisruptionbudgets - kind: PodSecurityPolicy name: podsecuritypolicies - groupVersion: rbac.authorization.k8s.io/v1 resources: - kind: ClusterRoleBinding name: clusterrolebindings - kind: ClusterRole name: clusterroles - kind: RoleBinding name: rolebindings - kind: Role name: roles - groupVersion: scheduling.k8s.io/v1 resources: - kind: PriorityClass name: priorityclasses - groupVersion: scheduling.koordinator.sh/v1alpha1 resources: - kind: Device name: devices - kind: PodMigrationJob name: podmigrationjobs - kind: Reservation name: reservations - groupVersion: scheduling.sigs.k8s.io/v1alpha1 resources: - kind: ElasticQuota name: elasticquotas - kind: PodGroup name: podgroups - groupVersion: slo.koordinator.sh/v1alpha1 resources: - kind: NodeMetric name: nodemetrics - kind: NodeSLO name: nodeslos - groupVersion: snapshot.storage.k8s.io/v1 resources: - kind: VolumeSnapshotClass name: volumesnapshotclasses - kind: VolumeSnapshotContent name: volumesnapshotcontents - kind: VolumeSnapshot name: volumesnapshots - groupVersion: snapshot.storage.k8s.io/v1beta1 resources: - kind: VolumeSnapshotClass name: volumesnapshotclasses - kind: VolumeSnapshotContent name: volumesnapshotcontents - kind: VolumeSnapshot name: volumesnapshots - groupVersion: status.gatekeeper.sh/v1beta1 resources: - kind: ConstraintPodStatus name: constraintpodstatuses - kind: ConstraintTemplatePodStatus name: constrainttemplatepodstatuses - kind: MutatorPodStatus name: mutatorpodstatuses - groupVersion: storage.k8s.io/v1 resources: - kind: CSIDriver name: csidrivers - kind: CSINode name: csinodes - kind: CSIStorageCapacity name: csistoragecapacities - kind: StorageClass name: storageclasses - kind: VolumeAttachment name: volumeattachments - groupVersion: storage.k8s.io/v1beta1 resources: - kind: CSIStorageCapacity name: csistoragecapacities - groupVersion: templates.gatekeeper.sh/v1 resources: - kind: ConstraintTemplate name: constrainttemplates - groupVersion: templates.gatekeeper.sh/v1alpha1 resources: - kind: ConstraintTemplate name: constrainttemplates - groupVersion: templates.gatekeeper.sh/v1beta1 resources: - kind: ConstraintTemplate name: constrainttemplates - groupVersion: topology.node.k8s.io/v1alpha1 resources: - kind: NodeResourceTopology name: noderesourcetopologies - groupVersion: ui.cattle.io/v1 resources: - kind: NavLink name: navlinks - groupVersion: v1 resources: - kind: Binding name: bindings - kind: ComponentStatus name: componentstatuses - kind: ConfigMap name: configmaps - kind: Endpoints name: endpoints - kind: Event name: events - kind: LimitRange name: limitranges - kind: Namespace name: namespaces - kind: Node name: nodes - kind: PersistentVolumeClaim name: persistentvolumeclaims - kind: PersistentVolume name: persistentvolumes - kind: Pod name: pods - kind: PodTemplate name: podtemplates - kind: ReplicationController name: replicationcontrollers - kind: ResourceQuota name: resourcequotas - kind: Secret name: secrets - kind: ServiceAccount name: serviceaccounts - kind: Service name: services conditions: - lastTransitionTime: "2023-07-19T07:38:49Z" message: cluster is healthy and ready to accept workloads reason: ClusterReady status: "True" type: Ready kubernetesVersion: v1.24.13 nodeSummary: readyNum: 211 totalNum: 212 resourceSummary: allocatable: cpu: "7560" ephemeral-storage: "502308137393688" hugepages-1Gi: "0" hugepages-2Mi: "0" koordinator.sh/gpu: "12800" koordinator.sh/gpu-core: "12800" koordinator.sh/gpu-memory: 1891602Mi koordinator.sh/gpu-memory-ratio: "12800" kubernetes.io/batch-cpu: "470529" kubernetes.io/batch-memory: "3540083593912" memory: 44085786916Ki netEgress: 218375Mi netIngress: 218375Mi nvidia.com/gpu: "136" pods: "12720" allocatableModelings: - count: 5 grade: 0 - count: 46 grade: 1 - count: 56 grade: 2 - count: 17 grade: 3 - count: 7 grade: 4 - count: 16 grade: 5 - count: 12 grade: 6 - count: 0 grade: 7 - count: 0 grade: 8 allocated: cpu: 5567273m kubernetes.io/batch-cpu: 529k kubernetes.io/batch-memory: "3439597715456" memory: "28402599938109" nvidia.com/gpu: "113" pods: "11054" allocating: cpu: 10m kubernetes.io/batch-cpu: "126720" kubernetes.io/batch-memory: "824633720832" pods: "25" ```

halfrost commented 9 months ago

I have received your @tedli case. I'll fix this tonight. I'm sooooo sorry I've been busy with work these two days.

RainbowMango commented 9 months ago

No worries, take your time @halfrost. I appreciate that you can help with it.

halfrost commented 8 months ago

@RainbowMango I am definitely responsible for my code to the end, this is my responsibility. I am also very sorry for the bug in my code.

I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right?

halfrost commented 8 months ago

@tedli Hi, Is it possible, can you tell me what is the input parameter of AddToResourceSummary(crn ClusterResourceNode) function when panic occurs?

tedli commented 8 months ago

Hi @halfrost ,

I don't know what the parameter is.

I can add some debug log, to dump this parameter for debugging, but I'm not sure when it will reproduce again.

halfrost commented 8 months ago

@tedli return -1 is a special case. Since model 8 goes to positive infinity, an index should be found. If you want to fix your panic quickly, the solution is to add a piece of code to line 165 of modeling.go:

    if index == -1 {
        klog.Info("ClusterResource can not add to resource summary: index is invalid.")
        return
    }

I will definitely need to add this code, but I am more curious about how to reproduce your panic, I want to find the root cause. Please help me to add some debug logs to get this parameter.

RainbowMango commented 8 months ago

I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right?

resourceModels:
  - grade: 0
    ranges:
    - max: "1"
      min: "0"
      name: cpu
    - max: 4Gi
      min: "0"
      name: memory
  - grade: 1
    ranges:
    - max: "2"
      min: "1"
      name: cpu
    - max: 16Gi
      min: 4Gi
      name: memory
  - grade: 2
    ranges:
    - max: "4"
      min: "2"
      name: cpu
    - max: 32Gi
      min: 16Gi
      name: memory

If a node has CPU=1 and even with a larger memory than 32Gi, it still belongs grade 2.

halfrost commented 8 months ago

@tedli Hi, any updates?

tedli commented 8 months ago

Hi @halfrost , I did add the debug log, but untill now nothing produced. I'm not sure how to reproduce this.

halfrost commented 8 months ago

@tedli Hi, any updates? Can you reproduce this panic again?

RainbowMango commented 8 months ago

Perhaps we can't rely on @tedli to reproduce this issue, it's not feasible to ask a user to spend time trying to reproduce it.

I remember we fixed a panic issue before(I'll try to find the PR), @tedli which version of Karmada is being used?

tedli commented 8 months ago

Hi @halfrost ,

Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know.

Hi @RainbowMango , I'm using 1.6.1 .

RainbowMango commented 8 months ago

It is the https://github.com/karmada-io/karmada/issues/3472, we fixed it in release-1.6.

halfrost commented 8 months ago

Perhaps we can't rely on @tedli to reproduce this issue, it's not feasible to ask a user to spend time trying to reproduce it.

I remember we fixed a panic issue before(I'll try to find the PR), @tedli which version of Karmada is being used?

@RainbowMango So sorry about that. It's my fault. I'm trying to reproduce this problem again.

RainbowMango commented 8 months ago

I always appreciate your help, definitely not blaming you. Just merged the defensive solution(#3950). So, no rush.

RainbowMango commented 8 months ago

Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know.

I remember you fixed it with a defensive code, so, this panic is unlikely to reproduce on your side, isn't it?

tedli commented 8 months ago

Hi all,

I remember you fixed it with a defensive code

Yes, what's more, I add dump args logic, to log args to reproduce. https://github.com/karmada-io/karmada/issues/3854#issuecomment-1670710013

I will post these dumped args, if it appears again.

tedli commented 8 months ago

UPDATE: This is another place, which goes index out of range, and it's easier than the previous one to debug, because it's only relys on Cluster cr, I unmarshal the cluster, to reproduce it locally, didn't reproduce.

Another place caused index out of range.

https://github.com/karmada-io/karmada/blob/90ccbcdfceb0c65cc0b49445bb5e54aa4819c7e6/pkg/modeling/modeling.go#L88-L93

line 91, ~~index~~.

The capture shows line 106, is because I added this:

for dumpping args for reproducing the previous panic. (but it didn't reproduced again)

halfrost commented 8 months ago

@tedli Got it, let me dig deeper.

NickYadance commented 7 months ago

One of my talented colleagues solved this by adding mutex before reading or writing modelSortings (and of course unlock after the reading or writing). The issue never happened again since the fix.

+   mu.Lock()
+   defer mu.Unlock()
    // generate a sorted array by first priority of ResourceName
    modelSortings = make([][]resource.Quantity, len(rsName))
    for index := 0; index < len(rsList); index++ {
        for i, name := range rsName {
            modelSortings[i] = append(modelSortings[i], rsList[index][name])
        }
    }

func (rs *ResourceSummary) getIndex(crn ClusterResourceNode) int {
    index := math.MaxInt
+   mu.Lock()
+   defer mu.Unlock()

In our panic cases, this function can never return -1 in normal conditions because the the first element of nums is 0, but it did. That's why we tried mutex. https://github.com/karmada-io/karmada/blob/0b3e0d98c597124a82fdcf30522b57067296f645/pkg/modeling/modeling.go#L116

halfrost commented 7 months ago

@NickYadance

Your modification certainly makes sense. First, regarding the initialization of modelSortings, I believe this function only runs once during initialization, and scenarios where this function is called concurrently don't exist. Hence, I didn't add a lock().

The getIndex function looks like needs a lock. I've been focused on ensuring there are no bugs in my algorithm logic that I overlooked the fact that this piece of code could be called concurrently in a multi-threaded environment. I tested in a local environment with only 2 machines and didn't encounter any concurrency issue.

Please extend my gratitude to your colleague. I'm truly grateful for his fix.

NickYadance commented 7 months ago

Hope this helps, though i failed to find the root cause that can cause the concurrent situation. It just works ...

jwcesign commented 7 months ago

Hi, @halfrost @NickYadance

I reviewed the code and found that the function AddToResourceSummary will be called when there is a change in the status of the clusters.

The default concurrency for handling cluster status is 5: https://github.com/karmada-io/karmada/blob/5c77f4516eb9c27c0ce38ed25a389ab9e56d8122/cmd/controller-manager/app/options/options.go#L206

Based on https://github.com/karmada-io/karmada/issues/3854#issuecomment-1717412500, the root reason is: Concurrent read and write global variable modelSortings.

And I also checked the code here: https://github.com/karmada-io/karmada/blob/5c77f4516eb9c27c0ce38ed25a389ab9e56d8122/pkg/controllers/status/cluster_status_controller.go#L595C1-L620C1

The modelSortings looks like can be changed to a local variable, then this problem will not exist, what do you think? @halfrost

halfrost commented 6 months ago

@jwcesign Hi, I'm back and so sorry for the late reply. I got Covid last two weeks. I felt so bad. I have recovered. I have 2 questions about your comment:

The number 5, how did you come up with this number? By lots of concurrent testing?
For the modelSortings, modelSortings cannot be a local variable because this state needs to be kept up to date in real time. In the binary search below, we need to search for the latest status. I prefer the above approach of adding a lock(). After I locked it locally, there was indeed no panic.

Regarding the concurrency investigation of function AddToResourceSummary, what you actually mean is that the getAllocatableModelings() function will be called concurrently in the controller, right? AddToResourceSummary will be called in the function getAllocatableModelings(). I just tested it locally and I can indeed see a lot of calls to the AddToResourceSummary function. I think this function caused the panic. I plan to add lock() every time I update the resource. Do you agree with me about this solution?

I haven't submitted the PR yet because I'm afraid that the change to add lock() will be too general and affect performance. In all update operations, resources are only added and not deleted.

jwcesign commented 6 months ago

Hi, @halfrost

By reviewing the code, getAllocatableModelings will be called https://github.com/karmada-io/karmada/blob/5c77f4516eb9c27c0ce38ed25a389ab9e56d8122/pkg/controllers/status/cluster_status_controller.go#L108, and the concurrency of calling this function is 5 by default.
function AddToResourceSummary will initialize defaultModelSorting and modelSortings every time it's called, and function AddToResourceSummary will use them, and these two functions are only used in getAllocatableModelings. So I think defaultModelSorting and modelSortings can be local variables in the function getAllocatableModelings. And if they are local variables in the function getAllocatableModelings, the value is real-time.

Do I understand correctly for 2? @halfrost

chaosi-zju commented 6 months ago

Hi, @halfrost

I generally agree to what @jwcesign said:

defaultModelSorting and modelSortings can be local variables in the function getAllocatableModelings

Actually, I think defaultModelSorting and modelSortings can be the member variables of ResourceSummary, in this way, it can avoid concurrency issues caused by global variables and avoid the use of locks.

halfrost commented 6 months ago

@jwcesign Thanks for the advice, let me test it this weekend. I will find out the final solution.

@chaosi-zju OK, make sense. Let me revise this part of the code. Thanks.

halfrost commented 6 months ago

@jwcesign @chaosi-zju

I have already converted modelSortings into a member variable of ResourceSummary. Modifying defaultModelSorting is more challenging because it's used as a global variable within the clusterResourceNodeComparator comparator. Additionally, this comparator is used in other comparison functions.

🙏Please help me review the pr #4145

karmada-io / karmada

karmada-controller-manager: panic: runtime error: index out of range #3854