kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.51k stars 404 forks source link

Investigate the underlying Cause for the Failure of GKE- Upgrade Tests #9059

Closed anishj0shi closed 4 years ago

anishj0shi commented 4 years ago

Current Upgrade jobs are really flaky. and to ensure that the right change goes into our master branch, we need to stabilise these upgrade jobs .

AC

sayanh commented 4 years ago

Following are different reasons of failure:

dekiel commented 4 years ago

Today failure https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-upgrade-gardener-azure/1285379168627855360 Yesterday failure https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-upgrade-gardener-azure/1285016778316976128

dekiel commented 4 years ago

Another one on azure. https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-integration-gardener-azure/1286511648592367618

sayanh commented 4 years ago

Instability in the prow job for the upgrade scenario in SKR was due to various reasons. Most significant ones are as follows:

We did the following:

It was observed, istio-pilot scaled up and needed more CPU during certain tests. Later, we started using highcpu-16 machines and obviously the stability of the job was much better.

Screenshot 2020-08-03 at 09 37 33
Now to answer the question why some of the tests made istio-pilot ask for more CPU, I executed each test separately to find out the following: Test Comment
dex-integration Harmless
kiali Harmless
service-catalog Harmless
connector-service Harmless
rafter Harmless
serverless Harmless
connection-token-handler Harmless
console-backend Harmless(mostly as there was small CPU spike for a short time)
core-test-external-solution Demands CPU confirmed, causes Istio pilot to scale up
logging Harmless
monitoring Harmless
serverless-long Harmless
application-connector Demands CPU
application-operator Make Istio pilot scale up to 5 replicas from 3 and 3/5 istio-pilot pods go CPU hungry.
application-registry Demands CPU for a short period of time
api-gateway Demands CPU
apiserver-proxy Mostly harmless with a real short spike of CPU in istio pilot
cluster-users Does not cause istio pilot CPU hungry but the test pod has a CPU spike and is a long-running one
dex-connection Harmless
test-e2e-upgrade-execute-tests Harmless

By the way, Kyma installer makes Istio pilot demand for more CPU as well and makes it scale up during Kyma upgrade process(during the upgrade of monitoring chart).

IMO following are the action items:

sayanh commented 4 years ago

Action Items: