scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
337 stars 175 forks source link

sni_proxy close all CQL connections at the same time #1171

Closed fruch closed 1 year ago

fruch commented 1 year ago

Issue description

While running a test with 14 tenants with sni_proxy, After ~40min, we are running into a case all of the CQL connections (of all tenants) are getting closed at the same time.

from the haproxy log we see this:

[WARNING]  (820) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (820) : Proxy ssl hard-stopped (146 remaining conns will be closed).
[WARNING]  (820) : Some tasks resisted to hard-stop, exiting now.
[NOTICE]   (267) : haproxy version is 2.6.6-274d1a4
[WARNING]  (267) : Former worker (820) exited with code 0 (Exit)

Impact

this renders all of the tenets useless, cql driver can handle well 1-2 connections getting closed, but none of the tools expect all the connections to be close at the same time.

How frequently does it reproduce?

this happens on every run we are doing

Installation details

Kernel Version: 5.4.228-131.415.amzn2.x86_64 Scylla version (or git commit hash): 5.3.0~dev-20230214.2653865b34d8 with build-id eb6fb0dc2a97faec591d4020d9c3671de48b2436

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.1-13-gc6a6e05 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: eu-central-1)

Test: longevity-scylla-operator-12h-multitenant-eks Test id: 3720410a-7757-4d41-89af-320eae9656b1 Test name: scylla-staging/fruch/longevity-scylla-operator-12h-multitenant-eks Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 3720410a-7757-4d41-89af-320eae9656b1` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=3720410a-7757-4d41-89af-320eae9656b1) - Show all stored logs command: `$ hydra investigate show-logs 3720410a-7757-4d41-89af-320eae9656b1` ## Logs: - **db-cluster-3720410a.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/db-cluster-3720410a.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/db-cluster-3720410a.tar.gz) - **scylla-10_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-10_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-10_cluster_events-3720410a.log.tar.gz) - **scylla-5_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-5_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-5_cluster_events-3720410a.log.tar.gz) - **output-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/output-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/output-3720410a.log.tar.gz) - **scylla_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla_cluster_events-3720410a.log.tar.gz) - **debug-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/debug-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/debug-3720410a.log.tar.gz) - **events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/events-3720410a.log.tar.gz) - **scylla-12_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-12_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-12_cluster_events-3720410a.log.tar.gz) - **scylla-7_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-7_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-7_cluster_events-3720410a.log.tar.gz) - **sct-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/sct-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/sct-3720410a.log.tar.gz) - **scylla-9_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-9_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-9_cluster_events-3720410a.log.tar.gz) - **normal-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/normal-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/normal-3720410a.log.tar.gz) - **argus-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/argus-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/argus-3720410a.log.tar.gz) - **raw_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/raw_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/raw_events-3720410a.log.tar.gz) - **scylla-2_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-2_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-2_cluster_events-3720410a.log.tar.gz) - **scylla-13_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-13_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-13_cluster_events-3720410a.log.tar.gz) - **critical-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/critical-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/critical-3720410a.log.tar.gz) - **scylla-11_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-11_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-11_cluster_events-3720410a.log.tar.gz) - **scylla-4_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-4_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-4_cluster_events-3720410a.log.tar.gz) - **scylla-3_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-3_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-3_cluster_events-3720410a.log.tar.gz) - **warning-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/warning-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/warning-3720410a.log.tar.gz) - **summary-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/summary-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/summary-3720410a.log.tar.gz) - **scylla-8_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-8_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-8_cluster_events-3720410a.log.tar.gz) - **scylla-14_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-14_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-14_cluster_events-3720410a.log.tar.gz) - **error-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/error-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/error-3720410a.log.tar.gz) - **scylla-6_cluster_events-3720410a.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-6_cluster_events-3720410a.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/scylla-6_cluster_events-3720410a.log.tar.gz) - **monitor-set-3720410a.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/monitor-set-3720410a.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/monitor-set-3720410a.tar.gz) - **loader-set-3720410a.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/loader-set-3720410a.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/loader-set-3720410a.tar.gz) - **kubernetes-3720410a.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/kubernetes-3720410a.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3720410a-7757-4d41-89af-320eae9656b1/20230215_081051/kubernetes-3720410a.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/14/)
fruch commented 1 year ago

I'm trying this https://github.com/scylladb/scylla-cluster-tests/pull/5783/commits/b3078860de48ceadc5463a9bb132a0709dc64ea7 to configured the hard-stop-after to longer period

I can report that 5min had those disconnections go away, and it was working stable for 12h https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/30/

fruch commented 1 year ago

me and @zimnx we trying to figure out where the default of that value was coming from,

one suggesting was from: https://github.com/haproxytech/kubernetes-ingress/blob/c38aa87d9d03d3f8522749ca49363ff90b81eda0/pkg/haproxy/env/defaults.go#L68 it's say it's 30m, but not clear if it was used or not.

need to confirm with a running instance, what's the configuration says

zimnx commented 1 year ago

The default value is indeed 30mins.

I found one weird thing in your configuration while browsing through logs and haproxy controller logs. ~30mins (default hard-stop-after) before hard-stop which cased connection drop, there are following logs:

2023/03/22 17:46:07 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-0_cql-ssl: number of slots 1
2023/03/22 17:46:07 DEBUG   service/endpoints.go:123 Server slots in backend 'scylla_sct-cluster-us-east1-b-us-east1-0_cql-ssl' scaled to match scale-server-slots value: 1, reload required

then haproxy process gets a restart command, and after 30mins it's hard stopped dropping the connections. The default "server-slots", is set to 42. You set it explicily on ingress objects via annotations in scyllacluster to 1.

  exposeOptions:
    cql:
      ingress:
        annotations:
          haproxy.org/scale-server-slots: "1"
          haproxy.org/ssl-passthrough: "true"

I think it might be causing the restart which gets stuck for some reason. If you wouldn't set it, then haproxy wouldn't be reloaded.

fruch commented 1 year ago

seem like we introduced it in SCT based on some unfinished work in https://github.com/scylladb/scylla-operator/pull/1076

I'm removing it and trying it again: https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/31/

fruch commented 1 year ago

seem like we introduced it in SCT based on some unfinished work in #1076

I'm removing it and trying it again: https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/31/

@zimnx, removing only the scale-server-slots didn't seemed to work

Installation details

Kernel Version: 5.10.173-154.642.amzn2.x86_64 Scylla version (or git commit hash): 5.3.0~dev-20230329.6525209983d1 with build-id da8cde2a3d8c048a3a15dfd19fd14dd535fec6d1

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.2-8-gee48da7 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: eu-central-1)

Test: longevity-scylla-operator-12h-multitenant-eks Test id: de543eec-e26a-47b8-9d1f-6212afce6e85 Test name: scylla-staging/fruch/longevity-scylla-operator-12h-multitenant-eks Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor de543eec-e26a-47b8-9d1f-6212afce6e85` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=de543eec-e26a-47b8-9d1f-6212afce6e85) - Show all stored logs command: `$ hydra investigate show-logs de543eec-e26a-47b8-9d1f-6212afce6e85` ## Logs: - **db-cluster-de543eec.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/db-cluster-de543eec.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/db-cluster-de543eec.tar.gz) - **sct-runner-events-de543eec.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/sct-runner-events-de543eec.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/sct-runner-events-de543eec.tar.gz) - **sct-de543eec.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/sct-de543eec.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/sct-de543eec.log.tar.gz) - **monitor-set-de543eec.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/monitor-set-de543eec.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/monitor-set-de543eec.tar.gz) - **loader-set-de543eec.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/loader-set-de543eec.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/loader-set-de543eec.tar.gz) - **kubernetes-de543eec.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/kubernetes-de543eec.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/de543eec-e26a-47b8-9d1f-6212afce6e85/20230403_142237/kubernetes-de543eec.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/31/)
fruch commented 1 year ago

@zimnx @tnozicka

I've added a log, so now we have the full the log of haproxy, remove the scale-server-slots: "1"

we are stilling seeing this issue, during rolling restart or one of the clusters (1 out of 14):

2023/05/15 00:22:52 TRACE   controller.go:171 HAProxy config sync ended
[WARNING]  (686) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (686) : Proxy ssl hard-stopped (105 remaining conns will be closed).
[WARNING]  (679) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (679) : Proxy ssl hard-stopped (119 remaining conns will be closed).
[WARNING]  (686) : Some tasks resisted to hard-stop, exiting now.
[NOTICE]   (266) : haproxy version is 2.6.6-274d1a4
[WARNING]  (266) : Former worker (686) exited with code 0 (Exit)
[WARNING]  (679) : Some tasks resisted to hard-stop, exiting now.
[NOTICE]   (268) : haproxy version is 2.6.6-274d1a4
[WARNING]  (268) : Former worker (679) exited with code 0 (Exit)
[WARNING]  (813) : Server scylla-4_sct-cluster-4-us-east1-b-us-east1-2_cql-ssl/SRV_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING]  (801) : Server scylla-4_sct-cluster-4-us-east1-b-us-east1-2_cql-ssl/SRV_1 is UP, reason: Layer4 check passed, check duration: 0ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

Installation details

Kernel Version: 5.10.178-162.673.amzn2.x86_64 Scylla version (or git commit hash): 5.3.0~dev-20230512.7fcc4031229b with build-id d6f9b433d295cf0420d28abedc89ff756eb0b75e

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.3-5-g34369da Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: eu-north-1)

Test: longevity-scylla-operator-12h-multitenant-eks Test id: 75d73bb9-e15f-4955-a016-c36272dd91f1 Test name: scylla-staging/fruch/longevity-scylla-operator-12h-multitenant-eks Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 75d73bb9-e15f-4955-a016-c36272dd91f1` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=75d73bb9-e15f-4955-a016-c36272dd91f1) - Show all stored logs command: `$ hydra investigate show-logs 75d73bb9-e15f-4955-a016-c36272dd91f1` ## Logs: - **db-cluster-75d73bb9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/db-cluster-75d73bb9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/db-cluster-75d73bb9.tar.gz) - **sct-runner-events-75d73bb9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/sct-runner-events-75d73bb9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/sct-runner-events-75d73bb9.tar.gz) - **sct-75d73bb9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/sct-75d73bb9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/sct-75d73bb9.log.tar.gz) - **monitor-set-75d73bb9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/monitor-set-75d73bb9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/monitor-set-75d73bb9.tar.gz) - **loader-set-75d73bb9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/loader-set-75d73bb9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/loader-set-75d73bb9.tar.gz) - **kubernetes-75d73bb9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/kubernetes-75d73bb9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75d73bb9-e15f-4955-a016-c36272dd91f1/20230515_014759/kubernetes-75d73bb9.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/49/)
tnozicka commented 1 year ago

fyi, haproxy bump has landed https://github.com/scylladb/scylla-operator/pull/1235 - I wonder whether it changes something or not

fruch commented 1 year ago

fyi, haproxy bump has landed #1235 - I wonder whether it changes something or not

I could try it out, but it sounds like a longshot.

zimnx commented 1 year ago

I talked about our issue with one haproxy dev and apparently haproxy doesn't support moving existing connections to new process. They only support moving socket listeners. The process that requires reload is killed either when all existing connections are done or after configurable timeout hard-stop-after (30min by default). Driver connections are long-lived, i'm even not sure whether idle connections are closed at all.

We could extend the timeout to some high value, then haproxy would keep adding more proccess when reload is required, new connections would connect to the new one. Eventually 'stale' processes would be killed. But having lots of these process would cause increased memory consumption. How high, we would need to measure.

Unfortunately a haproxy reload is required when Ingress is added/removed, as new backend is added to configuration file and this requires a reload.

fruch commented 1 year ago

I talked about our issue with one haproxy dev and apparently haproxy doesn't support moving existing connections to new process. They only support moving socket listeners. The process that requires reload is killed either when all existing connections are done or after configurable timeout hard-stop-after (30min by default). Driver connections are long-lived, i'm even not sure whether idle connections are closed at all.

We could extend the timeout to some high value, then haproxy would keep adding more proccess when reload is required, new connections would connect to the new one. Eventually 'stale' processes would be killed. But having lots of these process would cause increased memory consumption. How high, we would need to measure.

Unfortunately a haproxy reload is required when Ingress is added/removed, as new backend is added to configuration file and this requires a reload.

I think the major issue regarding which timeout we'll set, is that all the connection would be lost at once, and application using CQL driver can't manage this situation (i.e. all of out tooling doesn't, cassandra-stress, scylla-bench, ycsb and gemini)

do we have a way to control how the reload happens, i.e. if we have 10 instances of haproxy, reload happens on all of them at the same time ?

we to find some way we don't close all of a client connection at the same time.

do we know how Cassandra is doing it ?

tnozicka commented 1 year ago

Long lived connections have to account for being terminated at some point, even with rolling restarts (pod for pod). So there is usually a graceful timeout of 60 seconds or so and then the connection is closed and the client needs to reestablish it. I don't think it makes sense to try to strech that time artificially high, we don't want to wait 30 minutes to restart a single pod in a set. I'd assume when it closes the client connection a new one should be reestablished and drivers should handle new calls with a new connection.

fruch commented 1 year ago

Long lived connections have to account for being terminated at some point, even with rolling restarts (pod for pod). So there is usually a graceful timeout of 60 seconds or so and then the connection is closed and the client needs to reestablish it. I don't think it makes sense to try to strech that time artificially high, we don't want to wait 30 minutes to restart a single pod in a set. I'd assume when it closes the client connection a new one should be reestablished and drivers should handle new calls with a new connection.

we are talking about restarts of the haproxy, and that's current affect all the nodes, and all the clusters, and the exact same time. cause of one pod restarted/deleted.

no current CQL driver can handle it gracefully.

mykaul commented 1 year ago

In future deployments, it may make sense to have more than one SNI proxy - even per AZ, and for the driver to go through both, for load balancing and high availability.

tnozicka commented 1 year ago

in production, there will be HA proxy setup for each AZ, but I don't think it make a difference as the reload happens to all proxies at once because of ingress changes and even without that for rolling restarts you may be unlucky enough that all your connections hit the same proxy with round robin or random balancing policies - with scale the likelihood is lower.

no current CQL driver can handle it gracefully.

should it? (if you don't have an active connection, create one on demand)

mykaul commented 1 year ago

in production, there will be HA proxy setup for each AZ, but I don't think it make a difference as the reload happens to all proxies at once because of ingress changes and even without that for rolling restarts you may be unlucky enough that all your connections hit the same proxy with round robin or random balancing policies - with scale the likelihood is lower.

no current CQL driver can handle it gracefully.

should it? (if you don't have an active connection, create one on demand)

I thought so too...

tnozicka commented 1 year ago

Reload at once kinda ruins that.

Does graceful reload ruin it as well?

Graceful reload:

  1. Create a new pod/worker
  2. Old pod/worker stops accepting connection, but serves the old ones
  3. shutdown timeout, say 90s for LB to notice it (slightly bigger then LB probe cycle)
  4. Old pod/worker closes connections and terminates
  5. Only new worker serves connections

In case you are reloading "in process" (same IP+port) step 3. can be skipped. But at any point, new connection (retry or not) should always succeed.

fruch commented 1 year ago

Reload at once kinda ruins that.

Does graceful reload ruin it as well?

Graceful reload:

  1. Create a new pod/worker
  2. Old pod/worker stops accepting connection, but serves the old ones
  3. shutdown timeout, say 90s for LB to notice it (slightly bigger then LB probe cycle)
  4. Old pod/worker closes connections and terminates
  5. Only new worker serves connections

In case you are reloading "in process" (same IP+port) step 3. can be skipped. But at any point, new connection (retry or not) should always succeed.

Still we have a time when all connections would be broken at once, so from client POV it's not so graceful.

I don't see how the situation is going to be better than the current state.

tnozicka commented 1 year ago

Still we have a time when all connections would be broken at once, so from client POV it's not so graceful.

I don't get it - at any point in time a new connection will succeed and long lived connections are meant to be reestablished gracefully (for the client).

fruch commented 1 year ago

Still we have a time when all connections would be broken at once, so from client POV it's not so graceful.

I don't get it - at any point in time a new connection will succeed and long lived connections are meant to be reestablished gracefully (for the client).

How is the client gonna know it's new to reestablish a given connection ?

Clients now would keep connections until closed, so if one node stops and connections break it's fine, but severing all open connections at the same time breaks the service from client POV.

I think this is a showstopper for our current SNI approach.

tnozicka commented 1 year ago

How is the client gonna know it's new to reestablish a given connection ?

it's gonna get closed, usually when idle (in keep alive mode)

but severing all open connections at the same time breaks the service from client POV.

maybe today - but you'd have to explain to me why a client can't open a new (on-demand) connection when all the pre-cached ones get closed

I think this is a showstopper for our current SNI approach.

I suppose even if that would be an issue (I don't think it is at this point), it would be an issue with a particular Ingress controller implementation (haproxy), not SNI approach in general. We use haproxy because it was cheap to start with but we can switch it or introduce our own golang proxy which we may need for cluster pausing anyways.

fruch commented 1 year ago

How is the client gonna know it's new to reestablish a given connection ?

it's gonna get closed, usually when idle (in keep alive mode)

but severing all open connections at the same time breaks the service from client POV.

maybe today - but you'd have to explain to me why a client can't open a new (on-demand) connection when all the pre-cached ones get closed

driver can open new connections if connections breaks, and move to the next connection. but it doesn't have any connections at all, it get back with a failure to the called.

in the case we are talking about, caller (i.e. cassnadra stress) is trying x10 times the same request, and still failing)

I'm not sure it's o.k. to tell users you'll need to retry x time/times more then you do with regular scylla/cassnadra deployment since we gonna reload haproxy once in a while...

I think this is a showstopper for our current SNI approach.

I suppose even if that would be an issue (I don't think it is at this point), it would be an issue with a particular Ingress controller implementation (haproxy), not SNI approach in general. We use haproxy because it was cheap to start with but we can switch it or introduce our own golang proxy which we may need for cluster pausing anyways.

tnozicka commented 1 year ago

but it doesn't have any connections at all, it get back with a failure to the called.

this is where I propose it opens a new connection on-demand and give it to the caller instead of an error

in the case we are talking about, caller (i.e. cassnadra stress) is trying x10 times the same request, and still failing)

what is the reason?

I'm not sure it's o.k. to tell users you'll need to retry x time/times more then you do with regular scylla/cassnadra deployment since we gonna reload haproxy once in a while...

definitely not, this should never get to the user and should be handled by the library without an error by just opening an on demand connection if the pool is empty

fruch commented 1 year ago

but it doesn't have any connections at all, it get back with a failure to the called.

this is where I propose it opens a new connection on-demand and give it to the caller instead of an error

in the case we are talking about, caller (i.e. cassnadra stress) is trying x10 times the same request, and still failing)

what is the reason?

  • pool gives back 10 times a closed connection?
  • connection get fails because pool is empty?
  • pool tries to establish a new connection on demand (10 times) and it fails to create it? (it should work but something may be off)

I'm not sure it's o.k. to tell users you'll need to retry x time/times more then you do with regular scylla/cassnadra deployment since we gonna reload haproxy once in a while...

definitely not, this should never get to the user and should be handled by the library without an error by just opening an on demand connection if the pool is empty

I can rerun it with extra driver logs, and then you can go over it with whom someone from the drivers team.

After that somehow we'll need to fix/validate across all drivers.

fruch commented 1 year ago

@zimnx @tnozicka, here's a re-run with c-s and driver logs enabled

here one example of the failure:

java.io.IOException: Operation x10 on key(s) [34324b4f303439323930]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (tried: a25ad45e9a6b741809da4bdef37b9ded-1349193356.eu-north-1.elb.amazonaws.com:9142:ad4261b0-5c29-4106-9ba4-56fe0b2d23c0.cql.sct-cluster-10.sct.scylladb.com (com.datastax.driver.core.exceptions.ConnectionException: [a25ad45e9a6b741809da4bdef37b9ded-1349193356.eu-north-1.elb.amazonaws.com:9142:ad4261b0-5c29-4106-9ba4-56fe0b2d23c0.cql.sct-cluster-10.sct.scylladb.com] Write attempt on defunct connection), a25ad45e9a6b741809da4bdef37b9ded-1349193356.eu-north-1.elb.amazonaws.com:9142:ade6bc03-a9e7-435f-ad8f-51a1fa690f02.cql.sct-cluster-10.sct.scylladb.com (com.datastax.driver.core.exceptions.ConnectionException: [a25ad45e9a6b741809da4bdef37b9ded-1349193356.eu-north-1.elb.amazonaws.com:9142:ade6bc03-a9e7-435f-ad8f-51a1fa690f02.cql.sct-cluster-10.sct.scylladb.com] Write attempt on defunct connection))

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)

Installation details

Kernel Version: 5.10.179-166.674.amzn2.x86_64 Scylla version (or git commit hash): 5.4.0~dev-20230602.8be69fc3a087 with build-id 824da7c9ac7baeb719819cc56991aebe48371426

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.4-14-gdb443d0 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-12h-multitenant-eks Test id: 4a28dcd1-7085-4bea-b472-11d6f2959aff Test name: scylla-staging/fruch/longevity-scylla-operator-12h-multitenant-eks Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 4a28dcd1-7085-4bea-b472-11d6f2959aff` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=4a28dcd1-7085-4bea-b472-11d6f2959aff) - Show all stored logs command: `$ hydra investigate show-logs 4a28dcd1-7085-4bea-b472-11d6f2959aff` ## Logs: - **db-cluster-4a28dcd1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/db-cluster-4a28dcd1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/db-cluster-4a28dcd1.tar.gz) - **sct-runner-events-4a28dcd1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/sct-runner-events-4a28dcd1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/sct-runner-events-4a28dcd1.tar.gz) - **sct-4a28dcd1.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/sct-4a28dcd1.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/sct-4a28dcd1.log.tar.gz) - **monitor-set-4a28dcd1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/monitor-set-4a28dcd1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/monitor-set-4a28dcd1.tar.gz) - **loader-set-4a28dcd1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/loader-set-4a28dcd1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/loader-set-4a28dcd1.tar.gz) - **kubernetes-4a28dcd1.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/kubernetes-4a28dcd1.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/4a28dcd1-7085-4bea-b472-11d6f2959aff/20230602_160525/kubernetes-4a28dcd1.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/fruch/job/longevity-scylla-operator-12h-multitenant-eks/58/) [Argus](https://argus.scylladb.com/test/37ffa9a1-870c-4858-8404-4cea2c335766/runs?additionalRuns[]=4a28dcd1-7085-4bea-b472-11d6f2959aff)
zimnx commented 1 year ago

I skimmed through both gocql and java driver code around connection pools, and they seem to have poor support for loosing all the connections.

gocql starts asynchronous pool filling, when it detects that it doesn't fully cover all shards of particular node, and returns nil - which causes an error to be returned back to the user - when the pool is empty: https://github.com/scylladb/gocql/blob/v1.7.3/connectionpool.go#L313-L333

Java behaves the same, an error without reconnect is returned to the user.

Instead, drivers should at least try to reconnect and return nil only if it's not successful. Otherwise rolling restart isn't fully supported.

Operation x10 [...] All host(s) tried for query failed [...] Write attempt on defunct connection

Sound like a closed connection is returned to the driver by the pool.

zimnx commented 1 year ago

Issues reported in drivers: https://github.com/scylladb/gocql/issues/140 https://github.com/scylladb/java-driver/issues/236

tnozicka commented 1 year ago

Closing this in favour of the specific driver issues, thanks!

fruch commented 12 months ago

@tnozicka

isn't this the same issue as in https://github.com/scylladb/scylla-operator/issues/1341

also was those driver issues were handover to any one in the driver team ?, cause doesn't seems anyone is aware of those @roydahan, @avelanarius FYI

tnozicka commented 12 months ago

@tnozicka isn't this the same issue as in https://github.com/scylladb/scylla-operator/issues/1341

I don't think so.

also was those driver issues were handover to any one in the driver team ?

2 comments above @zimnx has referenced https://github.com/scylladb/gocql/issues/140 and https://github.com/scylladb/java-driver/issues/236 that have been filed the driver issues. Are you saying the team that manages those repos is not aware of them?

avelanarius commented 12 months ago

I was aware of the issues when they were filled and I read them. I didn't like the proposed solution (it would negatively impact non-proxy workloads) and I couldn't immediately find any other better solution. Combined with our current priority of serverless, those issue therefore weren't prioritized by me.