gocrane / crane

Crane is a FinOps Platform for Cloud Resource Analytics and Economics in Kubernetes clusters. The goal is not only to help users to manage cloud cost easier but also ensure the quality of applications.
https://gocrane.io
Apache License 2.0
1.87k stars 382 forks source link

When updating the HPA in the EHPA Controller, if there is a conflict, the updated content will be lost #815

Closed mtdtdev closed 1 year ago

mtdtdev commented 1 year ago

Describe the bug In the new feature (improvement hpa and substitute), if the update of hpa fails due to concurrency conflicts, the update content will be lost. https://github.com/gocrane/crane/commit/e977b5e0918c1b535df85a75476ef9da589e1a4f

Specific code: https://github.com/gocrane/crane/blob/main/pkg/controller/ehpa/hpa.go#L165 The operator wants to update HPA A1, if the update conflicts, it will directly obtain the latest HPA object A2 from the cluster, and then use A2 to re-initiate the update. And this A2 is not the content A1 that is expected to be updated. At the same time, if A2 update is successful, the exception of update failure will also be lost, resulting in A1 not being updated, but operator thinks that HPA A1 has been updated

if needUpdate {
    hpaCopy := hpaExist.DeepCopy()
    err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
        err := c.Update(ctx, hpaCopy)
        if err == nil {
            return nil
        }

        updated := &autoscalingv2.HorizontalPodAutoscaler{}
        errGet := c.Get(context.TODO(), types.NamespacedName{Namespace: hpaCopy.Namespace, Name: hpaCopy.Name}, updated)
        if errGet == nil {
            hpaCopy = updated
        }

        return err
    })

    if err != nil {
        c.Recorder.Event(ehpa, v1.EventTypeWarning, "FailedUpdateHPA", err.Error())
        klog.Errorf("Failed to update HorizontalPodAutoscaler %s error %v", klog.KObj(hpaExist), err)
        return nil, err
    }

    klog.Infof("Update HorizontalPodAutoscaler successful, HorizontalPodAutoscaler %s", klog.KObj(hpaExist))
}

Reproduce steps

Expected behavior

Screenshots

Environment (please complete the following information):

qmhu commented 1 year ago

I think it is a bug, would you like to fix it ? @mtdtdev