scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 93 forks source link

[GKE] dynamic loader is failing after ~10 hours #5646

Open fruch opened 1 year ago

fruch commented 1 year ago

Issue description

loader pods are getting stopped after a long run duration

< t:2023-01-05 00:49:24,614 f:kubernetes_cmd_runner.py l:405  c:sdcm.remote.kubernetes_cmd_runner p:WARNING > 'process_is_finished': stopping 'sct-loaders-0-pod-2' pod because stream to it cannot be established having alive pod.

seems like the POD_COUNTER_TO_LIVE = 300 limit is too low when we are running such a long test

@vponomaryov until we could figure why or logs api are getting stopped that often in GKE, maybe we should raise this number higher ? maybe based on the test duration ? maybe only for GKE ?

Installation details

Kernel Version: 5.15.0-1020-gke Scylla version (or git commit hash): 2022.1.3-20220922.539a55e35 with build-id d1fb2faafd95058a04aad30b675ff7d2b930278d Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.1/relocatable/2022-09-22T13:36:03Z/scylla-enterprise-x86_64-package.tar.gz Operator Image: scylladb/scylla-operator:1.8.0-rc.0 Operator Helm Version: 1.8.0-rc.0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (n1-highmem-16)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: N/A (k8s-gke: us-east1)

Test: longevity-scylla-operator-basic-12h-gke Test id: 7a41565f-b96a-45f4-b0be-6aa3191808fd Test name: scylla-operator/operator-1.8/gke/longevity-scylla-operator-basic-12h-gke Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 7a41565f-b96a-45f4-b0be-6aa3191808fd` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7a41565f-b96a-45f4-b0be-6aa3191808fd) - Show all stored logs command: `$ hydra investigate show-logs 7a41565f-b96a-45f4-b0be-6aa3191808fd` ## Logs: - **db-cluster-7a41565f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/db-cluster-7a41565f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/db-cluster-7a41565f.tar.gz) - **sct-runner-7a41565f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/sct-runner-7a41565f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/sct-runner-7a41565f.tar.gz) - **monitor-set-7a41565f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/monitor-set-7a41565f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/monitor-set-7a41565f.tar.gz) - **loader-set-7a41565f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/loader-set-7a41565f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/loader-set-7a41565f.tar.gz) - **kubernetes-7a41565f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/kubernetes-7a41565f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7a41565f-b96a-45f4-b0be-6aa3191808fd/20230105_013700/kubernetes-7a41565f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.8/job/gke/job/longevity-scylla-operator-basic-12h-gke/2/)
fruch commented 1 year ago

one more thought

we are using kubernetes==18.20.0, maybe we should update to newer release ? (pure guessing, but the core failing in again and again is using this package, with Handshake status 400 Bad Request)

vponomaryov commented 1 year ago

one more thought

we are using kubernetes==18.20.0, maybe we should update to newer release ? (pure guessing, but the core failing in again and again is using this package, with Handshake status 400 Bad Request)

When I was implementing this feature the log streams in GKE were hanging pretty often. Interval was about 5min. So, if we multiply 5min to 300 attempts which are coded we get much more than 12 hours. Also, I think that something went wrong there that led to the exceeding the attempts number. For example, it could be API rate limit. So, better to make it use static loader after fix of the s-b running possibility.

We can update K8S lib version, but I don't think it is the reason.