Investigate the underlying Cause for the Failure of GKE- Upgrade Tests

anishj0shi commented 4 years ago

Current Upgrade jobs are really flaky. and to ensure that the right change goes into our master branch, we need to stabilise these upgrade jobs .

AC

Investigate the underlying cause for the failure of this upgrade jobs
Create followup issues if needed to fix these prow jobs.

sayanh commented 4 years ago

Following are different reasons of failure:

Fix test-e2e-upgrade-test-prepare-data (link)
https://storage.googleapis.com/kyma-prow-logs/pr-logs/pull/kyma-project_kyma/9046/pre-master-kyma-gke-upgrade/1285224523968286720/build-log.txt

dekiel commented 4 years ago

Today failure https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-upgrade-gardener-azure/1285379168627855360 Yesterday failure https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-upgrade-gardener-azure/1285016778316976128

dekiel commented 4 years ago

Another one on azure. https://status.build.kyma-project.io/view/gcs/kyma-prow-logs/logs/kyma-integration-gardener-azure/1286511648592367618

sayanh commented 4 years ago

Instability in the prow job for the upgrade scenario in SKR was due to various reasons. Most significant ones are as follows:

HelmBrokerConflictUpgradeTest in upgrade test failed a couple of times. The investigation will be done in #9121. Although this was hard to reproduce, pre-submit upgrade succeeded almost all the time https://status.build.kyma-project.io/?pull=9114&job=pre-master-kyma-gke-upgrade
Document view test in the Console web test-suite was often failing. https://github.com/kyma-project/kyma/pull/9133 disabled the test which rendered stability significantly. The cause is still unknown.
Innumerable occurrences of Connection closed early in the gcloud stackdriver logs. https://console.cloud.google.com/logs/query;query=resource.labels.cluster_name%3D%22gke-upgrade-rel-dfi4zjazi2%22%0AprotoPayload.serviceName%3D%22k8s.io%22%0AprotoPayload.status.message%3D%22Connection%20closed%20early%22?authuser=1&project=sap-kyma-prow-workloads&query=%0A The cause is still unknown but this was found in cases where the cluster was healthy. Chasing this issue didn't bring up much insight into the problem.
One of the reasons for failure for upgrade test was the prepare-data stage failed in eventmeshupgradetest test-e2e-upgrade-test-prepare-data fails in #8435. This could not be reproduced at all(see here).

We did the following:

Switching to regional GKE clusters helped in reinstating stability as there are guarantees on the availability of kube API servers.

With n4-standard machines in GKE, the CPU demand from the job was certainly not enough in some situations. One snapshot from the past showed:

NAME                                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gke-gke-upgrade-rel-dfi4-default-pool-107e4c2d-3km1   3349m        85%    4902Mi          39%
gke-gke-upgrade-rel-dfi4-default-pool-3e8fd7cc-xz09   3959m        100%   4239Mi          34%
gke-gke-upgrade-rel-dfi4-default-pool-60df5466-62w6   3039m        77%    3689Mi          29%

It was observed, istio-pilot scaled up and needed more CPU during certain tests. Later, we started using highcpu-16 machines and obviously the stability of the job was much better.

Now to answer the question why some of the tests made istio-pilot ask for more CPU, I executed each test separately to find out the following:	Test	Comment
dex-integration	Harmless
kiali	Harmless
service-catalog	Harmless
connector-service	Harmless
rafter	Harmless
serverless	Harmless
connection-token-handler	Harmless
console-backend	Harmless(mostly as there was small CPU spike for a short time)
core-test-external-solution	Demands CPU confirmed, causes Istio pilot to scale up
logging	Harmless
monitoring	Harmless
serverless-long	Harmless
application-connector	Demands CPU
application-operator	Make Istio pilot scale up to 5 replicas from 3 and 3/5 istio-pilot pods go CPU hungry.
application-registry	Demands CPU for a short period of time
api-gateway	Demands CPU
apiserver-proxy	Mostly harmless with a real short spike of CPU in istio pilot
cluster-users	Does not cause istio pilot CPU hungry but the test pod has a CPU spike and is a long-running one
dex-connection	Harmless
test-e2e-upgrade-execute-tests	Harmless

By the way, Kyma installer makes Istio pilot demand for more CPU as well and makes it scale up during Kyma upgrade process(during the upgrade of monitoring chart).

IMO following are the action items:

The CPU hungry tests can be executed solo if needed. We need to keep in mind that the dev workflow gets slower.
Each test needs to be profiled to understand the following:
- CPU usage
- How API server is bombarded?
- Are the retries using exponential or constant backoff? Exponential backoffs should be the way to go easy on the kube API server.
This is a dev pipeline job that is used to run the tests in parallel. IMO giving high CPUs to make the tests fast and stable is totally reasonable. What is not reasonable is: not having a stress testing job for a Kyma cluster which comes out of the box and has some SLOs around it. This would actually test Kyma in the SKR infrastructure specifications with a load that we have SLOs on.
- We could add a ton of tests but does it really match a customer's use-case(in terms of load)?
- Can't we add more tests cause that would increase cost(dev time or GCP) on testing?
Improvise testing mechanism:
- Ideal dev workflow(new proposal) requirements:
- run ONLY the tests needed for the PR
- should complete as fast as possible
- should save dev time and resources
- only when the PR is merged, run the complete suite
- Improve the dev workflow set in local requirements:
- Scripts to run the upgrade locally
- Currently, the upgrade chart needs to be installed separately(bummer) then we can use kyma test
Break test-e2e-upgrade-execute-tests into separate test definitions. Debugging gets easier. Does it bring value ATM?

sayanh commented 4 years ago

Action Items:

Following issues are created for the tests which are demanding more CPU to get more insight(fix things if needed)
https://github.com/kyma-project/kyma/issues/9177 is created to collect metrics for tests in an automated fashion and report(may be in testgrid) continuously

kyma-project / kyma

Investigate the underlying Cause for the Failure of GKE- Upgrade Tests #9059