teamhephy / jenkins-jobs

MIT License
0 stars 3 forks source link

add workflow stress test #23

Open Cryptophobia opened 6 years ago

Cryptophobia commented 6 years ago

From @bacongobbler on June 9, 2016 22:43

cross-post of https://github.com/deis/deis/issues/4037

@jchauncey has found some interesting problems when running significant load through deis. I'd like to see an automated version of this test (or similar) so that we can watch deis' performance over the course of future releases. We could even run this on the various providers and compare performance. :smile:

Copied from original issue: deis/jenkins-jobs#100

Cryptophobia commented 6 years ago

From @arschles on June 9, 2016 22:48

Related: https://github.com/deis/router/issues/198

Cryptophobia commented 6 years ago

From @jchauncey on June 10, 2016 16:20

so as it stands right now i can push a significant amount of requests through deis and not see any real degradation in performance. That being said we need to do a few other things besides just sending a lot of requests to router and ultimately to a simple go app.

My thoughts on this are still kind of cloudy but here is what I had in mind:

Get the data

Have telegraf send all metrics for e2e runs to a hosted influx system where we can collect long term meaningful metrics. This will allow us to spot trends and new problems more efficiently.

Regular e2e runs

Use the regular e2e runs to make sure we are within certain bounds performance wise. We should eventually hook up kapacitor scripts to alert us when an e2e run is outside of those params.

The load test

Setup a nightly job that runs on a normal size cluster (5 or so nodes) and it deploys apps which can simulate failures (returns non-200 response code), generate arbitrarily large response bodies, and maybe makes calls to other dependent services. We would then use the cli to arbitrarily scale those apps up and down while also doing simultaneous deploys and generating traffic. This would allow us to see how the system performs while apps are under load and the operator is using the system to respond.

My main concern is that during a high load event the controller can still receive requests to scale up/down to meet demand.