Open dkropachev opened 2 years ago
Reproduced using scylla-bench 0.1.8 (and scylla 5.0.0):
2022-08-02 12:51:31.875: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=f54a7bdb-6ce7-463d-ab2c-031d100d1ff5 duration=52m49s: node=Node longevity-counters-multidc-sb018-loader-node-3ef98c00-3 [34.245.137.69 | 10.4.1.81] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=counter_update -replication-factor=3 -partition-count=1000 -clustering-row-count=10 -concurrency 1024 -duration 360m -validate-data -error-at-row-limit 1000
errors:
Stress command completed with bad status 2: 2022/08/02 12:44:43 EOF
2022/08/02 12:44:43 EOF
2022/08/02 12:44:43 EOF
2022/08/02 12:44:43 EOF
2022
Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash): 5.0.0-20220628.c9e79cb4a
with build-id daef393310f38c086a6e5070c8f2013684973a48
Cluster size: 9 nodes (i3.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-050b81d683f061fbd ami-06d77742d3622e943 ami-0ae3b0658c86b30cd
(aws: eu-west-1)
Test: longevity-counters-multidc-test-sb018
Test id: 3ef98c00-c2f5-4853-8c81-60ec0b7c5de8
Test name: scylla-staging/Longevity_yaron/longevity-counters-multidc-test-sb018
Test config file(s):
>>>>>>> Your description here... <<<<<<<
$ hydra investigate show-monitor 3ef98c00-c2f5-4853-8c81-60ec0b7c5de8
$ hydra investigate show-logs 3ef98c00-c2f5-4853-8c81-60ec0b7c5de8
@yarongilor not enough information.
Got this for 2022.2.0~rc1
Kernel Version: 5.15.0-1019-aws
Scylla version (or git commit hash): 2022.2.0~rc1-20220902.a9bc6d191031
with build-id 074a0cb9e6a5ab36ba5e7f81385e68079ab6eeda
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc1.0.20220902.a9bc6d191031.tar.gz
Cluster size: 9 nodes (i3.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-0492fd8e302e5af45 ami-0c027d54190721a1f ami-07f3ede3d0d2961d0
(aws: eu-west-1)
Test: longevity-counters-multidc-test
Test id: 033f0cad-0d98-46da-8205-37f179134fa8
Test name: enterprise-2022.2/longevity/longevity-counters-multidc-test
Test config file(s):
>>>>>>>
At the beginning of disrupt_rebuild_streaming_err
(19:01:02) the node 4 (longevity-counters-multidc-2022-2-db-node-033f0cad-4) was stopped. Approximately at the same moment s-b started returning EOF
.
< t:2022-09-13 19:01:02,217 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2022-09-13 19:01:02,239 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 42m11.2s 7700 7700 0 235ms 227ms 214ms 204ms 199ms 143ms 144ms
< t:2022-09-13 19:01:02,392 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 41m22.5s 3219 3219 0 508ms 398ms 282ms 254ms 249ms 146ms 174ms
< t:2022-09-13 19:01:02,503 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 41m23s 3243 3243 0 411ms 392ms 284ms 254ms 248ms 146ms 173ms
< t:2022-09-13 19:01:02,522 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 41m26.8s 3273 3273 0 393ms 388ms 268ms 253ms 248ms 145ms 171ms
< t:2022-09-13 19:01:02,751 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 42m10s 7672 7672 0 249ms 227ms 215ms 205ms 200ms 143ms 144ms
< t:2022-09-13 19:01:02,893 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 42m8.3s 7717 7717 0 240ms 226ms 214ms 204ms 199ms 142ms 143ms
< t:2022-09-13 19:01:03,003 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/09/13 19:01:02 EOF
< t:2022-09-13 19:01:03,022 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/09/13 19:01:02 EOF
< t:2022-09-13 19:01:03,051 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !NOTICE | sudo[11747]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl --no-pager --no-legend
< t:2022-09-13 19:01:03,051 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !INFO | sudo[11747]: pam_unix(sudo:session): session opened for user root by (uid=0)
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !INFO | sudo[11747]: pam_unix(sudo:session): session closed for user root
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !WARNING | scylla[8204]: [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !NOTICE | sudo[11600]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | sudo[11600]: pam_unix(sudo:session): session opened for user root by (uid=0)
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: scylla-housekeeping-daily.timer: Succeeded.
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !WARNING | scylla[8204]: [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: Stopped Run Scylla Housekeeping daily mode.
< t:2022-09-13 19:01:03,053 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: scylla-housekeeping-restart.timer: Succeeded.
< t:2022-09-13 19:01:03,053 f:events_processes.py l:146 c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220913-171133-816342,id=0x7fcedb286f80,default=True]
< t:2022-09-13 19:01:03,053 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: Stopped Run Scylla Housekeeping restart mode.
< t:2022-09-13 19:01:03,053 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: Stopping Scylla JMX...
< t:2022-09-13 19:01:03,053 f:events_processes.py l:146 c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220913-171133-816342,id=0x7fcedb286f80,default=True]
< t:2022-09-13 19:01:03,054 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13 19:01:03.052 <2022-09-13 19:01:02.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=39024300-0f1c-4a81-8f31-9038f98efd49: type=WARNING regex=(^WARNING|!\s*?WARNING).*\[shard.*\] line_number=7925 node=longevity-counters-multidc-2022-2-db-node-033f0cad-7
< t:2022-09-13 19:01:03,054 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !WARNING | scylla[8204]: [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,056 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13 19:01:03.053 <2022-09-13 19:01:02.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=39024300-0f1c-4a81-8f31-9038f98efd49: type=WARNING regex=(^WARNING|!\s*?WARNING).*\[shard.*\] line_number=7926 node=longevity-counters-multidc-2022-2-db-node-033f0cad-7
< t:2022-09-13 19:01:03,056 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7 !WARNING | scylla[8204]: [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,153 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !NOTICE | systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !WARNING | systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: Stopped Scylla JMX.
< t:2022-09-13 19:01:03,154 f:events_processes.py l:146 c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220913-171133-816342,id=0x7fcedb286f80,default=True]
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | scylla[7925]: [shard 0] compaction_manager - Asked to stop
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4 !INFO | systemd[1]: Stopping Scylla Server...
At ~19:01-19:03 there is the same picture in all the logs of each loader node:
42m4.6s 7680 7680 0 238ms 231ms 215ms 205ms 200ms 143ms 144ms
42m5.6s 7703 7703 0 246ms 228ms 215ms 205ms 200ms 143ms 144ms
42m6.7s 7728 7728 0 234ms 226ms 215ms 204ms 200ms 142ms 143ms
42m7.8s 7679 7679 0 237ms 226ms 215ms 204ms 200ms 143ms 144ms
42m8.9s 7702 7702 0 238ms 228ms 214ms 204ms 200ms 143ms 144ms
42m10s 7672 7672 0 249ms 227ms 215ms 205ms 200ms 143ms 144ms
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
42m11.2s 7704 7704 0 236ms 225ms 214ms 204ms 200ms 143ms 144ms
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
42m12.2s 7693 7693 0 237ms 226ms 214ms 204ms 200ms 143ms 144ms
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
42m13.3s 7674 7674 0 240ms 228ms 213ms 204ms 200ms 143ms 144ms
From Grafana screenshots it looks like the load actaully stopped completely.
In the events.log
2022-09-13 20:25:28.564: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=76b94711-9e0b-4a0e-8ba6-3e840ec6a160 duration=2h6m43s: node=Node longevity-counters-multidc-2022-2-loader-node-033f0cad-3 [54.246.230.9 | 10.4.3.2] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=counter_update -replication-factor=3 -partition-count=1000 -clustering-row-count=10 -concurrency 1024 -duration 360m -validate-data -error-at-row-limit 1000
errors:
Stress command completed with bad status 137: 2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022
<<<<<<<
$ hydra investigate show-monitor 033f0cad-0d98-46da-8205-37f179134fa8
$ hydra investigate show-logs 033f0cad-0d98-46da-8205-37f179134fa8
we'll have to build the old version v0.1.3 for SCT https://github.com/scylladb/scylla-cluster-tests/commit/214326a9d47e7b751489638d5dfaa781f68a7700
and try to pin down what's broken in this one
seems like that in the counter_update command we are not setting timeouts beside the EOFs we see timeouts all the way from a certion point
scylla-bench -workload=uniform -mode=counter_update -replication-factor=3 -partition-count=1000 -clustering-row-count=10 -concurrency 1024 -duration 360m -validate-data -error-at-row-limit 1000
Timeout: 5s
Retries:
number: 0
min interval: 1s
max interval: 1s
@fruch I don't understand your comment about counters_update.
we got reproduction with s-b 0.1.3
executed from docker. Also I see from commit history, that on 14 January 2022 we bumped up version of go.
Maybe this is rather related to go version instead of specific s-b version?
Kernel Version: 5.15.0-1019-aws
Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a
with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc2.0.20220919.75d087a2b75a.tar.gz
Cluster size: 5 nodes (i3en.3xlarge)
Scylla Nodes used in this run:
OS / Image: ami-00bd31f22bcf5ae1a
(aws: eu-west-1)
Test: longevity-large-partition-200k-pks-4days-test
Test id: c729edd1-2896-49d0-9520-057d3ed9df22
Test name: enterprise-2022.2/longevity/longevity-large-partition-200k-pks-4days-test
Test config file(s):
$ hydra investigate show-monitor c729edd1-2896-49d0-9520-057d3ed9df22
$ hydra investigate show-logs c729edd1-2896-49d0-9520-057d3ed9df22
@fruch shall we try building scylla-bench image with base go image version 1.16 and see if it reproduces?
@fruch shall we try building scylla-bench image with base go image version 1.16 and see if it reproduces?
if anyone investing time in reproducing this ones, I think we should try the latest s-b version, and figure out what are the issues with that, and not going back to older version of s-b and go versions.
Another reproduced instance: Installation details Kernel version: 5.11.0-1028-aws Scylla version (or git commit hash): 5.1.dev-0.20220217.69fcc053b with build-id b8415b1ebbffff2b4183734680f4afab3bfed86d Cluster size: 4 nodes (i3en.2xlarge) Scylla running with shards number (live nodes): longevity-twcs-48h-master-db-node-d3f0b0ff-1 (3.250.216.253 | 10.0.3.124): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-2 (34.242.112.161 | 10.0.2.63): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-4 (52.18.71.120 | 10.0.0.189): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-5 (54.194.249.252 | 10.0.0.178): 8 shards Scylla running with shards number (terminated nodes): longevity-twcs-48h-master-db-node-d3f0b0ff-3 (54.154.245.131 | 10.0.2.210): 8 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-041d8500e7cf30167 (aws: eu-west-1)
Test: longevity-twcs-48h-test Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):
longevity-twcs-48h.yaml Issue description
====================================
2022-02-18 16:16:29.292: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=506e1e7b-88aa-4cab-a6a4-45168f5bb513 duration=11h31m49s: node=Node longevity-twcs-48h-master-loader-node-d3f0b0ff-2 [18.202.227.20 | 10.0.1.29] (seed: False) stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200 -rows-per-request=100 -start-timestamp=GET_WRITE_TIMESTAMP -write-rate 125 -distribution hnormal --connection-count 100 -duration=2880m -error-at-row-limit 1000 errors: Stress command completed with bad status 1: 2022/02/18 04:44:50 EOF 2022/02/18 04:44:50 EOF 2022/02/18 04:44:50 EOF 2022/02/18 04:44:50 EOF 2022
Restore Monitor Stack command: $ hydra investigate show-monitor d3f0b0ff-3989-41f4-9eaf-a0f510ea5895 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs d3f0b0ff-3989-41f4-9eaf-a0f510ea5895
Test id: d3f0b0ff-3989-41f4-9eaf-a0f510ea5895
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_162212/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220218_162212-longevity-twcs-48h-master-monitor-node-d3f0b0ff-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/db-cluster-d3f0b0ff.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/loader-set-d3f0b0ff.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/monitor-set-d3f0b0ff.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/sct-runner-d3f0b0ff.tar.gz
Jenkins job URL