scylladb / scylla-bench

43 stars 36 forks source link

sb reports EOF and exit the process #92

Open dkropachev opened 2 years ago

dkropachev commented 2 years ago

Another reproduced instance: Installation details Kernel version: 5.11.0-1028-aws Scylla version (or git commit hash): 5.1.dev-0.20220217.69fcc053b with build-id b8415b1ebbffff2b4183734680f4afab3bfed86d Cluster size: 4 nodes (i3en.2xlarge) Scylla running with shards number (live nodes): longevity-twcs-48h-master-db-node-d3f0b0ff-1 (3.250.216.253 | 10.0.3.124): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-2 (34.242.112.161 | 10.0.2.63): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-4 (52.18.71.120 | 10.0.0.189): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-5 (54.194.249.252 | 10.0.0.178): 8 shards Scylla running with shards number (terminated nodes): longevity-twcs-48h-master-db-node-d3f0b0ff-3 (54.154.245.131 | 10.0.2.210): 8 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-041d8500e7cf30167 (aws: eu-west-1)

Test: longevity-twcs-48h-test Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

longevity-twcs-48h.yaml Issue description

====================================

2022-02-18 16:16:29.292: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=506e1e7b-88aa-4cab-a6a4-45168f5bb513 duration=11h31m49s: node=Node longevity-twcs-48h-master-loader-node-d3f0b0ff-2 [18.202.227.20 | 10.0.1.29] (seed: False) stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200 -rows-per-request=100 -start-timestamp=GET_WRITE_TIMESTAMP -write-rate 125 -distribution hnormal --connection-count 100 -duration=2880m -error-at-row-limit 1000 errors: Stress command completed with bad status 1: 2022/02/18 04:44:50 EOF 2022/02/18 04:44:50 EOF 2022/02/18 04:44:50 EOF 2022/02/18 04:44:50 EOF 2022

Restore Monitor Stack command: $ hydra investigate show-monitor d3f0b0ff-3989-41f4-9eaf-a0f510ea5895 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs d3f0b0ff-3989-41f4-9eaf-a0f510ea5895

Test id: d3f0b0ff-3989-41f4-9eaf-a0f510ea5895

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_162212/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220218_162212-longevity-twcs-48h-master-monitor-node-d3f0b0ff-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/db-cluster-d3f0b0ff.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/loader-set-d3f0b0ff.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/monitor-set-d3f0b0ff.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/sct-runner-d3f0b0ff.tar.gz

Jenkins job URL

yarongilor commented 2 years ago

Reproduced using scylla-bench 0.1.8 (and scylla 5.0.0):

2022-08-02 12:51:31.875: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=f54a7bdb-6ce7-463d-ab2c-031d100d1ff5 duration=52m49s: node=Node longevity-counters-multidc-sb018-loader-node-3ef98c00-3 [34.245.137.69 | 10.4.1.81] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=counter_update -replication-factor=3 -partition-count=1000 -clustering-row-count=10 -concurrency 1024 -duration 360m -validate-data -error-at-row-limit 1000
errors:
Stress command completed with bad status 2: 2022/08/02 12:44:43 EOF
2022/08/02 12:44:43 EOF
2022/08/02 12:44:43 EOF
2022/08/02 12:44:43 EOF
2022

Installation details

Kernel Version: 5.13.0-1031-aws Scylla version (or git commit hash): 5.0.0-20220628.c9e79cb4a with build-id daef393310f38c086a6e5070c8f2013684973a48 Cluster size: 9 nodes (i3.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-050b81d683f061fbd ami-06d77742d3622e943 ami-0ae3b0658c86b30cd (aws: eu-west-1)

Test: longevity-counters-multidc-test-sb018 Test id: 3ef98c00-c2f5-4853-8c81-60ec0b7c5de8 Test name: scylla-staging/Longevity_yaron/longevity-counters-multidc-test-sb018 Test config file(s):

Issue description

>>>>>>> Your description here... <<<<<<<

Logs:

Jenkins job URL

roydahan commented 2 years ago

@yarongilor not enough information.

ilya-rarov commented 2 years ago

Got this for 2022.2.0~rc1

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc1-20220902.a9bc6d191031 with build-id 074a0cb9e6a5ab36ba5e7f81385e68079ab6eeda Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc1.0.20220902.a9bc6d191031.tar.gz Cluster size: 9 nodes (i3.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0492fd8e302e5af45 ami-0c027d54190721a1f ami-07f3ede3d0d2961d0 (aws: eu-west-1)

Test: longevity-counters-multidc-test Test id: 033f0cad-0d98-46da-8205-37f179134fa8 Test name: enterprise-2022.2/longevity/longevity-counters-multidc-test Test config file(s):

Issue description

>>>>>>> At the beginning of disrupt_rebuild_streaming_err (19:01:02) the node 4 (longevity-counters-multidc-2022-2-db-node-033f0cad-4) was stopped. Approximately at the same moment s-b started returning EOF.

< t:2022-09-13 19:01:02,217 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2022-09-13 19:01:02,239 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 42m11.2s    7700    7700      0 235ms  227ms  214ms  204ms  199ms  143ms  144ms  
< t:2022-09-13 19:01:02,392 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 41m22.5s    3219    3219      0 508ms  398ms  282ms  254ms  249ms  146ms  174ms  
< t:2022-09-13 19:01:02,503 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 41m23s    3243    3243      0 411ms  392ms  284ms  254ms  248ms  146ms  173ms  
< t:2022-09-13 19:01:02,522 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 41m26.8s    3273    3273      0 393ms  388ms  268ms  253ms  248ms  145ms  171ms  
< t:2022-09-13 19:01:02,751 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 42m10s    7672    7672      0 249ms  227ms  215ms  205ms  200ms  143ms  144ms  
< t:2022-09-13 19:01:02,893 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 42m8.3s    7717    7717      0 240ms  226ms  214ms  204ms  199ms  142ms  143ms  
< t:2022-09-13 19:01:03,003 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/09/13 19:01:02 EOF
< t:2022-09-13 19:01:03,022 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2022/09/13 19:01:02 EOF
< t:2022-09-13 19:01:03,051 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7   !NOTICE | sudo[11747]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl --no-pager --no-legend
< t:2022-09-13 19:01:03,051 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7     !INFO | sudo[11747]: pam_unix(sudo:session): session opened for user root by (uid=0)
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7     !INFO | sudo[11747]: pam_unix(sudo:session): session closed for user root
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7  !WARNING | scylla[8204]:  [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4   !NOTICE | sudo[11600]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | sudo[11600]: pam_unix(sudo:session): session opened for user root by (uid=0)
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: scylla-housekeeping-daily.timer: Succeeded.
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7  !WARNING | scylla[8204]:  [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,052 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: Stopped Run Scylla Housekeeping daily mode.
< t:2022-09-13 19:01:03,053 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: scylla-housekeeping-restart.timer: Succeeded.
< t:2022-09-13 19:01:03,053 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220913-171133-816342,id=0x7fcedb286f80,default=True]
< t:2022-09-13 19:01:03,053 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: Stopped Run Scylla Housekeeping restart mode.
< t:2022-09-13 19:01:03,053 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: Stopping Scylla JMX...
< t:2022-09-13 19:01:03,053 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220913-171133-816342,id=0x7fcedb286f80,default=True]
< t:2022-09-13 19:01:03,054 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13 19:01:03.052 <2022-09-13 19:01:02.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=39024300-0f1c-4a81-8f31-9038f98efd49: type=WARNING regex=(^WARNING|!\s*?WARNING).*\[shard.*\] line_number=7925 node=longevity-counters-multidc-2022-2-db-node-033f0cad-7
< t:2022-09-13 19:01:03,054 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7  !WARNING | scylla[8204]:  [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,056 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13 19:01:03.053 <2022-09-13 19:01:02.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=39024300-0f1c-4a81-8f31-9038f98efd49: type=WARNING regex=(^WARNING|!\s*?WARNING).*\[shard.*\] line_number=7926 node=longevity-counters-multidc-2022-2-db-node-033f0cad-7
< t:2022-09-13 19:01:03,056 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:DEBUG > 2022-09-13T19:01:02+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-7  !WARNING | scylla[8204]:  [shard 6] storage_proxy - Exception when communicating with 10.15.2.105, to read from scylla_bench.test_counters: abort requested
< t:2022-09-13 19:01:03,153 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4   !NOTICE | systemd[1]: scylla-jmx.service: Main process exited, code=exited, status=143/n/a
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4  !WARNING | systemd[1]: scylla-jmx.service: Failed with result 'exit-code'.
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: Stopped Scylla JMX.
< t:2022-09-13 19:01:03,154 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220913-171133-816342,id=0x7fcedb286f80,default=True]
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | scylla[7925]:  [shard 0] compaction_manager - Asked to stop
< t:2022-09-13 19:01:03,154 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-09-13T19:01:03+00:00 longevity-counters-multidc-2022-2-db-node-033f0cad-4     !INFO | systemd[1]: Stopping Scylla Server...

At ~19:01-19:03 there is the same picture in all the logs of each loader node:

42m4.6s    7680    7680      0 238ms  231ms  215ms  205ms  200ms  143ms  144ms  
42m5.6s    7703    7703      0 246ms  228ms  215ms  205ms  200ms  143ms  144ms  
42m6.7s    7728    7728      0 234ms  226ms  215ms  204ms  200ms  142ms  143ms  
42m7.8s    7679    7679      0 237ms  226ms  215ms  204ms  200ms  143ms  144ms  
42m8.9s    7702    7702      0 238ms  228ms  214ms  204ms  200ms  143ms  144ms  
42m10s    7672    7672      0 249ms  227ms  215ms  205ms  200ms  143ms  144ms  
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
42m11.2s    7704    7704      0 236ms  225ms  214ms  204ms  200ms  143ms  144ms  
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
42m12.2s    7693    7693      0 237ms  226ms  214ms  204ms  200ms  143ms  144ms  
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
42m13.3s    7674    7674      0 240ms  228ms  213ms  204ms  200ms  143ms  144ms  

From Grafana screenshots it looks like the load actaully stopped completely. In the events.log

2022-09-13 20:25:28.564: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=76b94711-9e0b-4a0e-8ba6-3e840ec6a160 duration=2h6m43s: node=Node longevity-counters-multidc-2022-2-loader-node-033f0cad-3 [54.246.230.9 | 10.4.3.2] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=counter_update -replication-factor=3 -partition-count=1000 -clustering-row-count=10 -concurrency 1024 -duration 360m -validate-data -error-at-row-limit 1000
errors:
Stress command completed with bad status 137: 2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022/09/13 19:01:02 EOF
2022

<<<<<<<

Logs:

Jenkins job URL

fruch commented 2 years ago

we'll have to build the old version v0.1.3 for SCT https://github.com/scylladb/scylla-cluster-tests/commit/214326a9d47e7b751489638d5dfaa781f68a7700

and try to pin down what's broken in this one

fruch commented 2 years ago

seems like that in the counter_update command we are not setting timeouts beside the EOFs we see timeouts all the way from a certion point

scylla-bench -workload=uniform -mode=counter_update -replication-factor=3 -partition-count=1000 -clustering-row-count=10 -concurrency 1024 -duration 360m -validate-data -error-at-row-limit 1000
Timeout:         5s
Retries:        
  number:        0
  min interval:      1s
  max interval:      1s
roydahan commented 2 years ago

@fruch I don't understand your comment about counters_update.

soyacz commented 2 years ago

we got reproduction with s-b 0.1.3 executed from docker. Also I see from commit history, that on 14 January 2022 we bumped up version of go. Maybe this is rather related to go version instead of specific s-b version?

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc2.0.20220919.75d087a2b75a.tar.gz Cluster size: 5 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

OS / Image: ami-00bd31f22bcf5ae1a (aws: eu-west-1)

Test: longevity-large-partition-200k-pks-4days-test Test id: c729edd1-2896-49d0-9520-057d3ed9df22 Test name: enterprise-2022.2/longevity/longevity-large-partition-200k-pks-4days-test Test config file(s):

Issue description

Logs:

Jenkins job URL

soyacz commented 2 years ago

@fruch shall we try building scylla-bench image with base go image version 1.16 and see if it reproduces?

fruch commented 2 years ago

@fruch shall we try building scylla-bench image with base go image version 1.16 and see if it reproduces?

if anyone investing time in reproducing this ones, I think we should try the latest s-b version, and figure out what are the issues with that, and not going back to older version of s-b and go versions.