ingress/haproxy is closing all connections and killing the c-s threads

fruch commented 1 year ago

Issue description

While running with 2 tenants with ingress configured, and doing the SoftRebootNode nemesis (basically just doing kubectl --namespace=scylla-2 delete pod sct-cluster-2-us-east1-b-us-east1-1 --grace-period=1800)

2022-12-12 17:19:04.293: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=08ccfb89-4d1c-4694-945d-75b2100c79aa: nemesis_name=SoftRebootNode target_node=Node sct-cluster-2-us-east1-b-us-east1-1 [10.0.3.85 | 10.0.0.250] (seed: False)

all the running cassandra-stress commands are getting there connection closed

DEBUG 17:36:00,087 Defuncting Connection[aa0823cc740474927bac34b8a7812337-269891302.eu-north-1.elb.amazonaws.com:9142:e2785d9f-6e9a-4234-ab57-7f74aaf76504.cql.sct-cluster-2.sct.scylladb.com-11, inFlight=0, closed=false] because: [aa0823cc740474927bac34b8a7812337-269891302.eu-north-1.elb.amazonaws.com:9142:e2785d9f-6e9a-4234-ab57-7f74aaf76504.cql.sct-cluster-2.sct.scylladb.com] Channel has been closed
DEBUG 17:36:00,087 Defuncting Connection[aa0823cc740474927bac34b8a7812337-269891302.eu-north-1.elb.amazonaws.com:9142:e2785d9f-6e9a-4234-ab57-7f74aaf76504.cql.sct-cluster-2.sct.scylladb.com-2, inFlight=1, closed=false] because: [aa0823cc740474927bac34b8a7812337-269891302.eu-north-1.elb.amazonaws.com:9142:e2785d9f-6e9a-4234-ab57-7f74aaf76504.cql.sct-cluster-2.sct.scylladb.com] Channel has been closed
java.io.IOException: Operation x30 on key(s) [4b4f4c3934354d4b5030]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)

the haproxy seems to be reload it's configuration at the time all connection were getting closed:

2022/12/12 17:13:55 TRACE   controller.go:171 HAProxy config sync ended
[NOTICE]   (266) : haproxy version is 2.6.6-274d1a4
[WARNING]  (266) : config : config: Can't get version of the global server state file '/var/state/haproxy/global'.
[NOTICE]   (266) : New worker (444) forked
[NOTICE]   (266) : Loading success.
[WARNING]  (433) : Proxy healthz stopped (cumulated conns: FE: 47, BE: 0).
[WARNING]  (433) : Proxy http stopped (cumulated conns: FE: 1880, BE: 0).
[WARNING]  (433) : Proxy https stopped (cumulated conns: FE: 0, BE: 0).
[WARNING]  (433) : Proxy ssl stopped (cumulated conns: FE: 110, BE: 0).
[WARNING]  (433) : Proxy stats stopped (cumulated conns: FE: 0, BE: 0).
[WARNING]  (444) : Server haproxy-controller_haproxy-kubernetes-ingress-default-backend_cql-ssl/SRV_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT]    (444) : backend 'haproxy-controller_haproxy-kubernetes-ingress-default-backend_cql-ssl' has no server available!
[WARNING]  (444) : Server scylla_sct-cluster-us-east1-b-us-east1-1_cql-ssl/SRV_1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT]    (444) : backend 'scylla_sct-cluster-us-east1-b-us-east1-1_cql-ssl' has no server available!
2022/12/12 17:19:04 TRACE   store/events.go:98 Treating endpoints event {SliceName: Namespace:scylla-2 Service:sct-cluster-2-client Ports:map[agent-api:0xc0007d3520 agent-prometheus:0xc0007d34f0 cql:0xc0007d3490 cql-shard-aware:0xc0007d3500 cql-ssl:0xc0007d34e0 cql-ssl-shard-aware:0xc0007d34a0 inter-node-communication:0xc0007d34c0 jmx-monitoring:0xc0007d3510 node-exporter:0xc0007d3540 prometheus:0xc0007d34b0 ssl-inter-node-communication:0xc0007d34d0 thrift:0xc0007d3530] Status:MODIFIED}
2022/12/12 17:19:04 TRACE   store/events.go:102 service sct-cluster-2-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} cql:{Port:9042 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} node-exporter:{Port:9100 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} prometheus:{Port:9180 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} thrift:{Port:9160 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]}]
2022/12/12 17:19:04 TRACE   store/events.go:107 service sct-cluster-2-client : number of already existing backend(s) in this transaction for this endpoint: 12
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql-ssl-shard-aware, backend {Name: HAProxySrvs:[] Endpoints:{Port:19142 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql, backend {Name: HAProxySrvs:[] Endpoints:{Port:9042 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name inter-node-communication, backend {Name: HAProxySrvs:[] Endpoints:{Port:7000 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name agent-prometheus, backend {Name: HAProxySrvs:[] Endpoints:{Port:5090 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql-shard-aware, backend {Name: HAProxySrvs:[] Endpoints:{Port:19042 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name jmx-monitoring, backend {Name: HAProxySrvs:[] Endpoints:{Port:7199 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name agent-api, backend {Name: HAProxySrvs:[] Endpoints:{Port:10001 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name prometheus, backend {Name: HAProxySrvs:[] Endpoints:{Port:9180 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name ssl-inter-node-communication, backend {Name: HAProxySrvs:[] Endpoints:{Port:7001 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name thrift, backend {Name: HAProxySrvs:[] Endpoints:{Port:9160 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name node-exporter, backend {Name: HAProxySrvs:[] Endpoints:{Port:9100 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql-ssl, backend {Name:scylla-2_sct-cluster-2-client_cql-ssl HAProxySrvs:[{Name:SRV_1 Address:10.0.1.230 Modified:false Port:0} {Name:SRV_2 Address:10.0.0.250 Modified:false Port:0} {Name:SRV_3 Address:10.0.1.42 Modified:false Port:0}] Endpoints:{Port:9142 Addresses:map[]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   api/runtime.go:86 updating backend  scylla-2_sct-cluster-2-client_cql-ssl for haproxy servers update (address and state) through socket
2022/12/12 17:19:04 TRACE   api/runtime.go:89 backend scylla-2_sct-cluster-2-client_cql-ssl: list of servers [{Name:SRV_1 Address:10.0.1.230 Modified:false Port:0} {Name:SRV_2 Address:10.0.0.250 Modified:false Port:0} {Name:SRV_3 Address:10.0.1.42 Modified:false Port:0}]
2022/12/12 17:19:04 TRACE   api/runtime.go:90 backend scylla-2_sct-cluster-2-client_cql-ssl: list of endpoints addresses map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]
2022/12/12 17:19:04 TRACE   api/runtime.go:117 backend scylla-2_sct-cluster-2-client_cql-ssl: list of servers after treatment  [{Name:SRV_1 Address:10.0.1.230 Modified:false Port:0} {Name:SRV_2 Address:10.0.0.250 Modified:false Port:0} {Name:SRV_3 Address:10.0.1.42 Modified:false Port:0}]
2022/12/12 17:19:04 TRACE   api/runtime.go:118 backend scylla-2_sct-cluster-2-client_cql-ssl: list of endpoints addresses after treatment  map[]
2022/12/12 17:19:04 TRACE   store/events.go:98 Treating endpoints event {SliceName:sct-cluster-2-client-hc5cp Namespace:scylla-2 Service:sct-cluster-2-client Ports:map[agent-api:0xc000335690 agent-prometheus:0xc0003356b0 cql:0xc0003356c0 cql-shard-aware:0xc000335720 cql-ssl:0xc0003357d0 cql-ssl-shard-aware:0xc000335700 inter-node-communication:0xc0003357c0 jmx-monitoring:0xc0003356d0 node-exporter:0xc0003356f0 prometheus:0xc0003357e0 ssl-inter-node-communication:0xc0003356e0 thrift:0xc000335710] Status:MODIFIED}
2022/12/12 17:19:04 TRACE   store/events.go:102 service sct-cluster-2-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} cql:{Port:9042 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} node-exporter:{Port:9100 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} prometheus:{Port:9180 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]} thrift:{Port:9160 Addresses:map[10.0.1.230:{} 10.0.1.42:{}]}]
2022/12/12 17:19:04 TRACE   store/events.go:107 service sct-cluster-2-client : number of already existing backend(s) in this transaction for this endpoint: 12
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name node-exporter, backend {Name: HAProxySrvs:[] Endpoints:{Port:9100 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name agent-prometheus, backend {Name: HAProxySrvs:[] Endpoints:{Port:5090 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql-shard-aware, backend {Name: HAProxySrvs:[] Endpoints:{Port:19042 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name jmx-monitoring, backend {Name: HAProxySrvs:[] Endpoints:{Port:7199 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name agent-api, backend {Name: HAProxySrvs:[] Endpoints:{Port:10001 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name prometheus, backend {Name: HAProxySrvs:[] Endpoints:{Port:9180 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name ssl-inter-node-communication, backend {Name: HAProxySrvs:[] Endpoints:{Port:7001 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name thrift, backend {Name: HAProxySrvs:[] Endpoints:{Port:9160 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql-ssl, backend {Name:scylla-2_sct-cluster-2-client_cql-ssl HAProxySrvs:[{Name:SRV_1 Address:10.0.1.230 Modified:false Port:0} {Name:SRV_2 Address:10.0.0.250 Modified:false Port:0} {Name:SRV_3 Address:10.0.1.42 Modified:false Port:0}] Endpoints:{Port:9142 Addresses:map[]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql-ssl-shard-aware, backend {Name: HAProxySrvs:[] Endpoints:{Port:19142 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name cql, backend {Name: HAProxySrvs:[] Endpoints:{Port:9042 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   store/events.go:109 service sct-cluster-2-client : port name inter-node-communication, backend {Name: HAProxySrvs:[] Endpoints:{Port:7000 Addresses:map[10.0.0.250:{} 10.0.1.230:{} 10.0.1.42:{}]} DynUpdateFailed:false}
2022/12/12 17:19:04 TRACE   api/runtime.go:86 updating backend  scylla-2_sct-cluster-2-client_cql-ssl for haproxy servers update (address and state) through socket
2022/12/12 17:19:04 TRACE   api/runtime.go:89 backend scylla-2_sct-cluster-2-client_cql-ssl: list of servers [{Name:SRV_1 Address:10.0.1.230 Modified:false Port:0} {Name:SRV_2 Address:10.0.0.250 Modified:false Port:0} {Name:SRV_3 Address:10.0.1.42 Modified:false Port:0}]
2022/12/12 17:19:04 TRACE   api/runtime.go:90 backend scylla-2_sct-cluster-2-client_cql-ssl: list of endpoints addresses map[10.0.1.230:{} 10.0.1.42:{}]
2022/12/12 17:19:04 TRACE   api/runtime.go:117 backend scylla-2_sct-cluster-2-client_cql-ssl: list of servers after treatment  [{Name:SRV_1 Address:10.0.1.230 Modified:false Port:0} {Name:SRV_2 Address: Modified:true Port:0} {Name:SRV_3 Address:10.0.1.42 Modified:false Port:0}]
2022/12/12 17:19:04 TRACE   api/runtime.go:118 backend scylla-2_sct-cluster-2-client_cql-ssl: list of endpoints addresses after treatment  map[]
2022/12/12 17:19:04 TRACE   api/runtime.go:127 backend scylla-2_sct-cluster-2-client_cql-ssl: server 'SRV_2' changed status to maint
[WARNING]  (444) : Server scylla-2_sct-cluster-2-client_cql-ssl/SRV_2 is going DOWN for maintenance. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022/12/12 17:19:05 TRACE   controller.go:94 HAProxy config sync started
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend haproxy-controller_haproxy-kubernetes-ingress-default-backend_cql-ssl: number of slots 42
2022/12/12 17:19:05 TRACE   ingress/ingress.go:159 Processing Ingress annotations in ConfigMap
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla/sct-cluster-client-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla/sct-cluster-client-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla/sct-cluster-client-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-client_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla/sct-cluster-us-east1-b-us-east1-0-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla/sct-cluster-us-east1-b-us-east1-0-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla/sct-cluster-us-east1-b-us-east1-0-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-0_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla/sct-cluster-us-east1-b-us-east1-1-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla/sct-cluster-us-east1-b-us-east1-1-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla/sct-cluster-us-east1-b-us-east1-1-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-1_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla/sct-cluster-us-east1-b-us-east1-2-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla/sct-cluster-us-east1-b-us-east1-2-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla/sct-cluster-us-east1-b-us-east1-2-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-2_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-2-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-2-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-2-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla-2_sct-cluster-2-us-east1-b-us-east1-2_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla-2/sct-cluster-2-client-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla-2/sct-cluster-2-client-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla-2/sct-cluster-2-client-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla-2_sct-cluster-2-client_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   service/endpoints.go:77 backend scylla-2_sct-cluster-2-client_cql-ssl: about to update server in configuration file :  models.Server { Name: SRV_2, Port: 9142, Address: 127.0.0.1, Maintenance: enabled }
2022/12/12 17:19:05 TRACE   service/endpoints.go:81 Updating server 'scylla-2_sct-cluster-2-client_cql-ssl/SRV_2'
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-0-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-0-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-0-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla-2_sct-cluster-2-us-east1-b-us-east1-0_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   ingress/ingress.go:218 Ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-1-cql': processing secrets...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:237 Ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-1-cql': processing annotations...
2022/12/12 17:19:05 TRACE   ingress/ingress.go:247 ingress 'scylla-2/sct-cluster-2-us-east1-b-us-east1-1-cql': processing rules...
2022/12/12 17:19:05 TRACE   service/endpoints.go:110 backend scylla-2_sct-cluster-2-us-east1-b-us-east1-1_cql-ssl: number of slots 1
2022/12/12 17:19:05 TRACE   controller.go:171 HAProxy config sync ended

Installation details

Kernel Version: 5.4.219-126.411.amzn2.x86_64 Scylla version (or git commit hash): 5.2.0~dev-20221207.47a8fad2a2bd with build-id aa015a1ce31da9ba79f718e2b2ef472e1eb3e835

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.8.0-alpha.0-162-g7be1034 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: eu-north-1)

Test: longevity-scylla-operator-3h-multitenant-eks Test id: 7f2241a8-2156-4345-8506-e2ca8f00be5c Test name: scylla-staging/fruch/longevity-scylla-operator-3h-multitenant-eks Test config file(s):

longevity-scylla-operator-3h-multitenant.yaml
Restore Monitor Stack command: $ hydra investigate show-monitor 7f2241a8-2156-4345-8506-e2ca8f00be5c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7f2241a8-2156-4345-8506-e2ca8f00be5c

Logs:

db-cluster-7f2241a8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f2241a8-2156-4345-8506-e2ca8f00be5c/20221212_180749/db-cluster-7f2241a8.tar.gz
monitor-set-7f2241a8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f2241a8-2156-4345-8506-e2ca8f00be5c/20221212_180749/monitor-set-7f2241a8.tar.gz
loader-set-7f2241a8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f2241a8-2156-4345-8506-e2ca8f00be5c/20221212_180749/loader-set-7f2241a8.tar.gz
sct-runner-7f2241a8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f2241a8-2156-4345-8506-e2ca8f00be5c/20221212_180749/sct-runner-7f2241a8.tar.gz
parallel-timelines-report-7f2241a8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f2241a8-2156-4345-8506-e2ca8f00be5c/20221212_180749/parallel-timelines-report-7f2241a8.tar.gz
kubernetes-7f2241a8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/7f2241a8-2156-4345-8506-e2ca8f00be5c/20221212_180749/kubernetes-7f2241a8.tar.gz

Jenkins job URL

fruch commented 1 year ago

Seems like a run without EBS did manage to run the whole duration of the test without issues Running it again with EBS (and disabling hinted handoff, since it takes longer to spin a node back with EBS, we accumulate much more hinted handoffs, which slows again the disk to write them)

Installation details

Kernel Version: 5.4.219-126.411.amzn2.x86_64 Scylla version (or git commit hash): 5.2.0~dev-20221207.47a8fad2a2bd with build-id aa015a1ce31da9ba79f718e2b2ef472e1eb3e835

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.8.0-alpha.0-162-g7be1034 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i3.4xlarge)

OS / Image: `` (k8s-eks: eu-north-1)

Test: longevity-scylla-operator-3h-multitenant-eks Test id: 5ca9d911-afca-4b41-a527-561fd2322b7a Test name: scylla-staging/fruch/longevity-scylla-operator-3h-multitenant-eks Test config file(s):

longevity-scylla-operator-3h-multitenant.yaml
Restore Monitor Stack command: $ hydra investigate show-monitor 5ca9d911-afca-4b41-a527-561fd2322b7a
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 5ca9d911-afca-4b41-a527-561fd2322b7a

Logs:

db-cluster-5ca9d911.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5ca9d911-afca-4b41-a527-561fd2322b7a/20221214_024027/db-cluster-5ca9d911.tar.gz
monitor-set-5ca9d911.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5ca9d911-afca-4b41-a527-561fd2322b7a/20221214_024027/monitor-set-5ca9d911.tar.gz
loader-set-5ca9d911.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5ca9d911-afca-4b41-a527-561fd2322b7a/20221214_024027/loader-set-5ca9d911.tar.gz
sct-runner-5ca9d911.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5ca9d911-afca-4b41-a527-561fd2322b7a/20221214_024027/sct-runner-5ca9d911.tar.gz
parallel-timelines-report-5ca9d911.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5ca9d911-afca-4b41-a527-561fd2322b7a/20221214_024027/parallel-timelines-report-5ca9d911.tar.gz
kubernetes-5ca9d911.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5ca9d911-afca-4b41-a527-561fd2322b7a/20221214_024027/kubernetes-5ca9d911.tar.gz

Jenkins job URL

fruch commented 1 year ago

Seems like it's a known issue with EBS (slow disks): https://github.com/scylladb/scylladb/issues/9906

and nothing todo sni-proxy...

mykaul commented 1 year ago

Seems like it's a known issue with EBS (slow disks): scylladb/scylladb#9906

You could try RAID0... and I'm not sure we optimize for the EBS IO size (which is 16K to 32K or so...)

and nothing todo sni-proxy...

fruch commented 1 year ago

Seems like it's a known issue with EBS (slow disks): scylladb/scylladb#9906

You could try RAID0... and I'm not sure we optimize for the EBS IO size (which is 16K to 32K or so...)

We could do lots of things around the area of EBS, question is should we do it now ... (and I was under the impression the answer was no)

and nothing todo sni-proxy...

mykaul commented 1 year ago

Seems like it's a known issue with EBS (slow disks): scylladb/scylladb#9906

You could try RAID0... and I'm not sure we optimize for the EBS IO size (which is 16K to 32K or so...)

We could do lots of things around the area of EBS, question is should we do it now ... (and I was under the impression the answer was no)

No, we should not invest in that.

and nothing todo sni-proxy...

scylladb / scylla-operator