Closed anishj0shi closed 4 years ago
Following are different reasons of failure:
Fix test-e2e-upgrade-test-prepare-data (link)
Instability in the prow job for the upgrade scenario in SKR was due to various reasons. Most significant ones are as follows:
HelmBrokerConflictUpgradeTest
in upgrade test failed a couple of times. The investigation will be done in #9121. Although this was hard to reproduce, pre-submit upgrade succeeded almost all the time https://status.build.kyma-project.io/?pull=9114&job=pre-master-kyma-gke-upgrade
Document view
test in the Console web
test-suite was often failing. https://github.com/kyma-project/kyma/pull/9133 disabled the test which rendered stability significantly. The cause is still unknown.
Innumerable occurrences of Connection closed early
in the gcloud stackdriver logs. https://console.cloud.google.com/logs/query;query=resource.labels.cluster_name%3D%22gke-upgrade-rel-dfi4zjazi2%22%0AprotoPayload.serviceName%3D%22k8s.io%22%0AprotoPayload.status.message%3D%22Connection%20closed%20early%22?authuser=1&project=sap-kyma-prow-workloads&query=%0A The cause is still unknown but this was found in cases where the cluster was healthy. Chasing this issue didn't bring up much insight into the problem.
One of the reasons for failure for upgrade test was the prepare-data stage failed in eventmeshupgradetest
test-e2e-upgrade-test-prepare-data fails in #8435. This could not be reproduced at all(see here).
We did the following:
Switching to regional GKE clusters helped in reinstating stability as there are guarantees on the availability of kube API servers.
With n4-standard machines in GKE, the CPU demand from the job was certainly not enough in some situations. One snapshot from the past showed:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-gke-upgrade-rel-dfi4-default-pool-107e4c2d-3km1 3349m 85% 4902Mi 39%
gke-gke-upgrade-rel-dfi4-default-pool-3e8fd7cc-xz09 3959m 100% 4239Mi 34%
gke-gke-upgrade-rel-dfi4-default-pool-60df5466-62w6 3039m 77% 3689Mi 29%
It was observed, istio-pilot scaled up and needed more CPU during certain tests. Later, we started using highcpu-16
machines and obviously the stability of the job was much better.
Now to answer the question why some of the tests made istio-pilot ask for more CPU, I executed each test separately to find out the following: | Test | Comment |
---|---|---|
dex-integration | Harmless | |
kiali | Harmless | |
service-catalog | Harmless | |
connector-service | Harmless | |
rafter | Harmless | |
serverless | Harmless | |
connection-token-handler | Harmless | |
console-backend | Harmless(mostly as there was small CPU spike for a short time) | |
core-test-external-solution | Demands CPU confirmed, causes Istio pilot to scale up | |
logging | Harmless | |
monitoring | Harmless | |
serverless-long | Harmless | |
application-connector | Demands CPU | |
application-operator | Make Istio pilot scale up to 5 replicas from 3 and 3/5 istio-pilot pods go CPU hungry. | |
application-registry | Demands CPU for a short period of time | |
api-gateway | Demands CPU | |
apiserver-proxy | Mostly harmless with a real short spike of CPU in istio pilot | |
cluster-users | Does not cause istio pilot CPU hungry but the test pod has a CPU spike and is a long-running one | |
dex-connection | Harmless | |
test-e2e-upgrade-execute-tests | Harmless |
By the way, Kyma installer makes Istio pilot demand for more CPU as well and makes it scale up during Kyma upgrade process(during the upgrade of monitoring
chart).
IMO following are the action items:
The CPU hungry tests can be executed solo if needed. We need to keep in mind that the dev workflow gets slower.
Each test needs to be profiled to understand the following:
This is a dev pipeline job that is used to run the tests in parallel. IMO giving high CPUs to make the tests fast and stable is totally reasonable. What is not reasonable is: not having a stress testing job for a Kyma cluster which comes out of the box and has some SLOs around it. This would actually test Kyma in the SKR infrastructure specifications with a load that we have SLOs on.
Improvise testing mechanism:
kyma test
Break test-e2e-upgrade-execute-tests into separate test definitions. Debugging gets easier. Does it bring value ATM?
Action Items:
Following issues are created for the tests which are demanding more CPU to get more insight(fix things if needed)
https://github.com/kyma-project/kyma/issues/9177 is created to collect metrics for tests in an automated fashion and report(may be in testgrid) continuously
Current Upgrade jobs are really flaky. and to ensure that the right change goes into our master branch, we need to stabilise these upgrade jobs .
AC