Closed fruch closed 1 year ago
Seems like a run without EBS did manage to run the whole duration of the test without issues Running it again with EBS (and disabling hinted handoff, since it takes longer to spin a node back with EBS, we accumulate much more hinted handoffs, which slows again the disk to write them)
Kernel Version: 5.4.219-126.411.amzn2.x86_64
Scylla version (or git commit hash): 5.2.0~dev-20221207.47a8fad2a2bd
with build-id aa015a1ce31da9ba79f718e2b2ef472e1eb3e835
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.8.0-alpha.0-162-g7be1034 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i3.4xlarge)
OS / Image: `` (k8s-eks: eu-north-1)
Test: longevity-scylla-operator-3h-multitenant-eks
Test id: 5ca9d911-afca-4b41-a527-561fd2322b7a
Test name: scylla-staging/fruch/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):
Restore Monitor Stack command: $ hydra investigate show-monitor 5ca9d911-afca-4b41-a527-561fd2322b7a
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 5ca9d911-afca-4b41-a527-561fd2322b7a
Seems like it's a known issue with EBS (slow disks): https://github.com/scylladb/scylladb/issues/9906
and nothing todo sni-proxy...
Seems like it's a known issue with EBS (slow disks): scylladb/scylladb#9906
You could try RAID0... and I'm not sure we optimize for the EBS IO size (which is 16K to 32K or so...)
and nothing todo sni-proxy...
Seems like it's a known issue with EBS (slow disks): scylladb/scylladb#9906
You could try RAID0... and I'm not sure we optimize for the EBS IO size (which is 16K to 32K or so...)
We could do lots of things around the area of EBS, question is should we do it now ... (and I was under the impression the answer was no)
and nothing todo sni-proxy...
Seems like it's a known issue with EBS (slow disks): scylladb/scylladb#9906
You could try RAID0... and I'm not sure we optimize for the EBS IO size (which is 16K to 32K or so...)
We could do lots of things around the area of EBS, question is should we do it now ... (and I was under the impression the answer was no)
No, we should not invest in that.
and nothing todo sni-proxy...
Issue description
While running with 2 tenants with ingress configured, and doing the
SoftRebootNode
nemesis (basically just doingkubectl --namespace=scylla-2 delete pod sct-cluster-2-us-east1-b-us-east1-1 --grace-period=1800
)all the running cassandra-stress commands are getting there connection closed
the haproxy seems to be reload it's configuration at the time all connection were getting closed:
Installation details
Kernel Version: 5.4.219-126.411.amzn2.x86_64 Scylla version (or git commit hash):
5.2.0~dev-20221207.47a8fad2a2bd
with build-idaa015a1ce31da9ba79f718e2b2ef472e1eb3e835
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.8.0-alpha.0-162-g7be1034 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i3.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: eu-north-1)
Test:
longevity-scylla-operator-3h-multitenant-eks
Test id:7f2241a8-2156-4345-8506-e2ca8f00be5c
Test name:scylla-staging/fruch/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):longevity-scylla-operator-3h-multitenant.yaml
Restore Monitor Stack command:
$ hydra investigate show-monitor 7f2241a8-2156-4345-8506-e2ca8f00be5c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command:
$ hydra investigate show-logs 7f2241a8-2156-4345-8506-e2ca8f00be5c
Logs:
Jenkins job URL