st-tech / gatling-operator

Automating distributed Gatling load testing using Kubernetes operator
MIT License
68 stars 21 forks source link

Fix duplicate slack message issue #11

Closed yokawasa closed 2 years ago

yokawasa commented 2 years ago

Description

fix for the issue #9

The changes I've made are the following

  1. Move the timing of cleaning-up job after at least one loop, and after all the Gatling job completed https://github.com/st-tech/gatling-operator/commit/2d81b9b1408205f4782d24f50bcfc031b2ef01a6
  2. Add a single loop (requeue) after a single reconciliation loop successfully done https://github.com/st-tech/gatling-operator/commit/8eb0f1a85255986986d1efb7cd70b8577b560724

No 2 fix above isn't directly for fixing the issue #9. It's just to have a single loop before moving to next stage to avoid a some timing issue.

What I made the No1 change to fix the issue?

Any time duplicate message issue occurs, I see the following gatling CR update error.

2021-11-10T12:23:24.509Z  ERROR controller-runtime.manager.controller.gatling.gatling.Reconcile Failed to update gatling status, and requeue  {"reconciler group": "gatling-operator.tech.zozo.com", "reconciler kind": "Gatling", "name": "zozo-aggregation-api", "namespace": "default", "error": "Operation cannot be fulfilled on gatlings.gatling-operator.tech.zozo.com \"zozo-aggregation-api\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
  /go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
github.com/st-tech/gatling-operator/controllers.(*GatlingReconciler).gatlingNotificationReconcile
  /workspace/controllers/gatling_controller.go:392
github.com/st-tech/gatling-operator/controllers.(*GatlingReconciler).Reconcile
  /workspace/controllers/gatling_controller.go:113
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:298
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2
  /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:216
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
  /go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
  /go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
  /go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
  /go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
  /go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
  /go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:99

the relevant part in the operator source code is this:

https://github.com/st-tech/gatling-operator/blob/2be50da0642f21d66f9e4d766216e5a8d55c8bca/controllers/gatling_controller.go#L386-L389

// Implementation of reconciler logic for the notification
func (r *GatlingReconciler) gatlingNotificationReconcile(ctx context.Context, req ctrl.Request, gatling *gatlingv1alpha1.Gatling, log logr.Logger) (bool, error) {
    var reportURL = "none"
    // Get cloud storage info only if gatling.spec.generateReport is true
    if gatling.Spec.GenerateReport {
        _, url, err := r.getCloudStorageInfo(ctx, gatling)
        if err != nil {
            log.Error(err, "Failed to get gatling storage info, and requeue")
            return true, err
        }
        reportURL = url
    }
    if err := r.sendNotification(ctx, gatling, reportURL); err != nil {
        log.Error(err, "Failed to sendNotification, but and requeue")
        return true, err
    }
    // Update gatling status on notification
/////////////////////////////////////////////////////////////////////////////////////////////
    gatling.Status.NotificationCompleted = true
    if err := r.updateGatlingStatus(ctx, gatling); err != nil {
        log.Error(err, "Failed to update gatling status, and requeue")
        return true, err
    }
/////////////////////////////////////////////////////////////////////////////////////////////
    log.Info("Notification has successfully been sent!")
    return false, nil
}

Just after this part, the Gatling operator cleans up the gatling job resources. the relevant part: https://github.com/st-tech/gatling-operator/blob/2be50da0642f21d66f9e4d766216e5a8d55c8bca/controllers/gatling_controller.go#L118-L130

I moved the timing of cleaning-up job after having at least one loop, and after all the Gatling job completed. This is because of the following my assumptions:

Test

I've actually made the same change to the operator in Nov 11th and deployed it to a testing environment. Ever since then, I haven't seen the same issue in the environment. I'm not 100% sure but from the several days observation in the testing environment, it looks like the issue has been fixed with this update.

yokawasa commented 2 years ago

@tmrekk121 thanks for the review. I'll go ahead to merge the PR