hellofresh / kangal

Run performance tests in Kubernetes cluster with Kangal
Apache License 2.0
161 stars 22 forks source link

Flow control throttling kangal and sysdig? #333

Open flah00 opened 8 months ago

flah00 commented 8 months ago

I recently installed sysdig on a test cluster. As it happens, it's the same cluster I run load tests on. While running sysdig I started a load test. Initially kangal controller timedout creating kubernetes resources. I increased the kubernetes client timeout.

And then the kangal controller was unable to create all of the kubernets resources on the first pass. But it succeeded on the second attempt. The error and stack trace are included.

Feb 12 09:30:50.961 kangal-controller E0212 14:30:50.108353 1 loadtest.go:472] there is a conflict with loadtest 'loadtest-coiling-lightningbug' between datastore and cache. it might be because object has been removed or modified in the datastore
Feb 12 09:30:50.961 kangal-controller Created JMeter resources
Feb 12 09:30:40.866 kangal-controller Created pods with test data
Feb 12 09:30:10.769 kangal-controller Remote custom data enabled, creating PVC
Feb 12 09:29:55.762 kangal-controller E0212 14:29:54.895207 1 loadtest.go:309] error syncing 'loadtest-coiling-lightningbug': client rate limiter Wait returned an error: context deadline exceeded, requeuing
Feb 12 09:29:55.762 kangal-controller error syncing loadtest, re-queuing
Feb 12 09:29:55.762 kangal-controller Error on creating new JMeter service
Feb 12 09:29:55.762 kangal-controller Created pods with test data
Feb 12 09:29:15.659 kangal-controller Remote custom data enabled, creating PVC
Feb 12 09:29:00.590 kangal-controller Created new namespace

Stack trace

github.com/hellofresh/kangal/pkg/controller.(*Controller).processNextWorkItem.func1
    /home/runner/work/kangal/kangal/pkg/controller/loadtest.go:299
github.com/hellofresh/kangal/pkg/controller.(*Controller).processNextWorkItem
    /home/runner/work/kangal/kangal/pkg/controller/loadtest.go:307
github.com/hellofresh/kangal/pkg/controller.(*Controller).runWorker
    /home/runner/work/kangal/kangal/pkg/controller/loadtest.go:240
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
    /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.25.1/pkg/util/wait/wait.go:157
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
    /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.25.1/pkg/util/wait/wait.go:158
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.25.1/pkg/util/wait/wait.go:135
k8s.io/apimachinery/pkg/util/wait.Until
    /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.25.1/pkg/util/wait/wait.go:92

Work around

I uninstalled sysdig and k8s api response time was much peppier. I'm already in touch with their support regarding the problem. Kangal controller also succeeds on its first pass. Clearly they have some work to do. But maybe kangal does as well?

Solution?

I'm not really sure what the expectation of flow control is... Should this be the exclusive province of cluster admins? Should charts offer some guidance for their apps? Should kangal include a priority level configuration and flow schema for its service account?

What do folks think?

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.