Open vponomaryov opened 1 year ago
Faced too long absence of the SNI connectivity also in one more CI job during the drain_kubernetes_node_then_replace_scylla_node
scenario in the 2-tenant setup.
Applied nemesis:
haproxy pod-1:
2023/08/17 14:44:53 TRACE service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-1_cql-ssl: number of slots 1
2023/08/17 14:44:53 TRACE controller.go:171 HAProxy config sync ended
[WARNING] (433) : soft-stop running for too long, performing a hard-stop.
[WARNING] (433) : Proxy ssl hard-stopped (31 remaining conns will be closed).
[WARNING] (433) : Some tasks resisted to hard-stop, exiting now.
[WARNING] (267) : Former worker (433) exited with code 0 (Exit)
2023/08/17 14:49:58 TRACE controller.go:94 HAProxy config sync started
haproxy pod-2:
2023/08/17 14:44:53 TRACE controller.go:171 HAProxy config sync ended
[WARNING] (431) : soft-stop running for too long, performing a hard-stop.
[WARNING] (431) : Proxy ssl hard-stopped (27 remaining conns will be closed).
[WARNING] (431) : Some tasks resisted to hard-stop, exiting now.
[NOTICE] (267) : haproxy version is 2.6.6-274d1a4
[WARNING] (267) : Former worker (431) exited with code 0 (Exit)
2023/08/17 14:49:58 TRACE controller.go:94 HAProxy config sync started
SCT.log:
2023-08-17 14:48:33,581 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [50393330353137333530]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)
2023-08-17 14:48:33,589 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-08-17 14:48:33.588: (InfoEvent Severity.NORMAL) period_type=not-set event_id=451c00f7-8533-42d3-90f5-ce69fb8cbba2: message=TEST_END
2023-08-17 14:48:33,591 f:tester.py l:2830 c:LongevityOperatorMultiTenantTest p:INFO > TearDown is starting...
Kernel Version: 5.10.184-175.749.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c
with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c
Operator Image: scylladb/scylla-operator:1.10.0-rc.0 Operator Helm Version: 1.10.0-rc.0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: undefined_region)
Test: longevity-scylla-operator-3h-multitenant-eks
Test id: b6c6963a-a019-44dd-b8cc-97aea5bdc31f
Test name: scylla-operator/operator-1.10/eks/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):
Faced too long absence of the SNI connectivity also in one more CI job during the disrupt_grow_shrink_cluster
scenario.
Applied nemesis:
haproxy pod-1:
2023/08/17 19:10:03 TRACE service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-3_cql-ssl: number of slots 1
2023/08/17 19:10:03 TRACE controller.go:171 HAProxy config sync ended
2023/08/17 19:15:37 TRACE store/events.go:98 Treating endpoints event {SliceName: Namespace:scylla Service:sct-cluster-client Ports:map[agent-api:0xc000ae82b0 agent-prometheus:0xc000ae8280 cql:0xc000ae8220 cql-shard-aware:0xc000ae8290 cql-ssl:0xc000ae8270 cql-ssl-shard-aware:0xc000ae8230 inter-node-communication:0xc000ae8250 jmx-monitoring:0xc000ae82a0 node-exporter:0xc000ae82d0 prometheus:0xc000ae8240 ssl-inter-node-communication:0xc000ae8260 thrift:0xc000ae82c0] Status:MODIFIED}
2023/08/17 19:15:37 TRACE store/events.go:102 service sct-cluster-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql:{Port:9042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} node-exporter:{Port:9100 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} prometheus:{Port:9180 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} thrift:{Port:9160 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]}]
2023/08/17 19:15:37 TRACE store/events.go:107 service sct-cluster-client : number of already existing backend(s) in this transaction for this endpoint: 12
haproxy pod-2:
2023/08/17 19:10:03 TRACE service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-3_cql-ssl: number of slots 1
2023/08/17 19:10:03 TRACE controller.go:171 HAProxy config sync ended
2023/08/17 19:15:37 TRACE store/events.go:98 Treating endpoints event {SliceName:sct-cluster-client-lt8jf Namespace:scylla Service:sct-cluster-client Ports:map[agent-api:0xc000683000 agent-prometheus:0xc000683010 cql:0xc000683020 cql-shard-aware:0xc000683080 cql-ssl:0xc0006830a0 cql-ssl-shard-aware:0xc000683060 inter-node-communication:0xc000683090 jmx-monitoring:0xc000683030 node-exporter:0xc000683050 prometheus:0xc0006830b0 ssl-inter-node-communication:0xc000683040 thrift:0xc000683070] Status:MODIFIED}
2023/08/17 19:15:37 TRACE store/events.go:102 service sct-cluster-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql:{Port:9042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} node-exporter:{Port:9100 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} prometheus:{Port:9180 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} thrift:{Port:9160 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]}]
2023/08/17 19:15:37 TRACE store/events.go:107 service sct-cluster-client : number of already existing backend(s) in this transaction for this endpoint: 12
SCT.log:
2023-08-17 19:16:33,601 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [38504c4c333950343131]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)
2023-08-17 19:16:33,606 f:tester.py l:2830 c:LongevityTest p:INFO > TearDown is starting...
Kernel Version: 5.10.184-175.749.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7
with build-id c7f9855620b984af24957d7ab0bd8054306d182e
Operator Image: scylladb/scylla-operator:1.10.0-rc.0 Operator Helm Version: 1.10.0-rc.0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: undefined_region)
Test: longevity-scylla-operator-3h-eks-grow-shrink
Test id: 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00
Test name: scylla-operator/operator-1.10/eks/longevity-scylla-operator-3h-eks-grow-shrink
Test config file(s):
Reported an issue in haproxy ingress controller: https://github.com/haproxytech/kubernetes-ingress/issues/564
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
Issue description
We have the test for upgrading K8S platform with following steps:
haproxy
andscylla-operator
pods. Each service has 2 pods and provisioner of different nodesSo, during the step
6
where we upgrade theauxiliary
node pool which hosts haproxy pods our loaders lose connectivity for long time, enough to fail the load.Notes
scylla-operator v1.9.0
and everything else the same. Proof: Argus, CIImpact
Loss of the network connectivity to Scylla pods using the SNI/haproxy for significant amount of time
How frequently does it reproduce?
100% using scylla-operator
1.10.0-rc.0
Installation details
Kernel Version: 5.10.186-179.751.amzn2.x86_64 Scylla version (or git commit hash):
2023.1.0~rc8-20230731.b6f7c5a6910c
with build-idf6e718548e76ccf3564ed2387b6582ba8d37793c
Operator Image: scylladb/scylla-operator:1.10.0-rc.0 Operator Helm Version: 1.10.0-rc.0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 3 pods (i4i.4xlarge)
OS / Image:
` (k8s-eks:
eu-north-1`)Test:
upgrade-platform-k8s-eks
Test id:379e7cd3-3b74-4f39-bb7f-b561a8251126
Test name:scylla-operator/operator-1.10/upgrade/upgrade-platform-k8s-eks
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 379e7cd3-3b74-4f39-bb7f-b561a8251126` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=379e7cd3-3b74-4f39-bb7f-b561a8251126) - Show all stored logs command: `$ hydra investigate show-logs 379e7cd3-3b74-4f39-bb7f-b561a8251126` ## Logs: - **kubernetes-379e7cd3.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/kubernetes-379e7cd3.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/kubernetes-379e7cd3.tar.gz) - **db-cluster-379e7cd3.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/db-cluster-379e7cd3.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/db-cluster-379e7cd3.tar.gz) - **sct-runner-events-379e7cd3.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/sct-runner-events-379e7cd3.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/sct-runner-events-379e7cd3.tar.gz) - **sct-379e7cd3.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/sct-379e7cd3.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/sct-379e7cd3.log.tar.gz) - **loader-set-379e7cd3.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/loader-set-379e7cd3.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/loader-set-379e7cd3.tar.gz) - **monitor-set-379e7cd3.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/monitor-set-379e7cd3.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/monitor-set-379e7cd3.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.10/job/upgrade/job/upgrade-platform-k8s-eks/3/) [Argus](https://argus.scylladb.com/test/547d14e4-9c22-4e1e-a765-bda60dc40a2e/runs?additionalRuns[]=379e7cd3-3b74-4f39-bb7f-b561a8251126)