Open yarongilor opened 2 years ago
Issue again reproduced with job:
Installation details
Kernel version: 5.11.0-1028-aws
Scylla version (or git commit hash): 5.1.dev-0.20220209.5099b1e27 with build-id b0986550af32b8da96b50442a53423047ed91696
Cluster size: 4 nodes (i3en.2xlarge)
Scylla running with shards number (live nodes):
longevity-twcs-48h-master-db-node-1a3d093d-1 (34.248.68.217 | 10.0.2.75): 8 shards
longevity-twcs-48h-master-db-node-1a3d093d-4 (34.247.155.5 | 10.0.2.44): 8 shards
longevity-twcs-48h-master-db-node-1a3d093d-5 (3.250.79.246 | 10.0.3.237): 8 shards
longevity-twcs-48h-master-db-node-1a3d093d-6 (34.244.37.56 | 10.0.0.57): 8 shards
Scylla running with shards number (terminated nodes):
longevity-twcs-48h-master-db-node-1a3d093d-2 (54.229.206.88 | 10.0.3.42): 8 shards
longevity-twcs-48h-master-db-node-1a3d093d-3 (34.253.105.243 | 10.0.2.98): 8 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0f894d9df1a4e76fc
(aws: eu-west-1)
Test: longevity-twcs-48h-test
Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):
Issue description
====================================
2022-02-11 10:30:08.885: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=c9442f32-bde5-489c-991b-0a0de71e893d node=Node longevity-twcs-48h-master-loader-node-1a3d093d-3 [54.170.161.43 | 10.0.2.226] (seed: False)
022-02-11 10:42:56.479: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=b1610c44-c0de-44cf-ae95-7915a0ffbf67 duration=6h10m36s: node=Node longevity-twcs-48h-master-loader-node-1a3d093d-1 [54.217.57.220 | 10.0.3.227] (seed: False) stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000 errors:
Stress command completed with bad status -1: 2022/02/11 04:37:47 EOF 2022/02/11 04:37:47 EOF 2022/02/11 04:37:47 EOF 2022/02/11 04:37:47 EOF 2022
====================================
Restore Monitor Stack command: $ hydra investigate show-monitor 1a3d093d-697d-40cc-8909-8949d48797b8
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 1a3d093d-697d-40cc-8909-8949d48797b8
Test id: 1a3d093d-697d-40cc-8909-8949d48797b8
Logs: grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_105743/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220211_105743-longevity-twcs-48h-master-monitor-node-1a3d093d-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_105743/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220211_105743-longevity-twcs-48h-master-monitor-node-1a3d093d-1.png) db-cluster - [https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/db-cluster-1a3d093d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/db-cluster-1a3d093d.tar.gz) loader-set - [https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/loader-set-1a3d093d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/loader-set-1a3d093d.tar.gz) monitor-set - [https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/monitor-set-1a3d093d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/monitor-set-1a3d093d.tar.gz) sct - [https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/sct-runner-1a3d093d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1a3d093d-697d-40cc-8909-8949d48797b8/20220211_110702/sct-runner-1a3d093d.tar.gz)
Another reproduced instance:
Installation details
Kernel version: 5.11.0-1028-aws
Scylla version (or git commit hash): 5.1.dev-0.20220217.69fcc053b with build-id b8415b1ebbffff2b4183734680f4afab3bfed86d
Cluster size: 4 nodes (i3en.2xlarge)
Scylla running with shards number (live nodes):
longevity-twcs-48h-master-db-node-d3f0b0ff-1 (3.250.216.253 | 10.0.3.124): 8 shards
longevity-twcs-48h-master-db-node-d3f0b0ff-2 (34.242.112.161 | 10.0.2.63): 8 shards
longevity-twcs-48h-master-db-node-d3f0b0ff-4 (52.18.71.120 | 10.0.0.189): 8 shards
longevity-twcs-48h-master-db-node-d3f0b0ff-5 (54.194.249.252 | 10.0.0.178): 8 shards
Scylla running with shards number (terminated nodes):
longevity-twcs-48h-master-db-node-d3f0b0ff-3 (54.154.245.131 | 10.0.2.210): 8 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-041d8500e7cf30167
(aws: eu-west-1)
Test: longevity-twcs-48h-test
Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):
Issue description
====================================
2022-02-18 16:16:29.292: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=506e1e7b-88aa-4cab-a6a4-45168f5bb513 duration=11h31m49s: node=Node longevity-twcs-48h-master-loader-node-d3f0b0ff-2 [18.202.227.20 | 10.0.1.29] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200 -rows-per-request=100 -start-timestamp=GET_WRITE_TIMESTAMP -write-rate 125 -distribution hnormal --connection-count 100 -duration=2880m -error-at-row-limit 1000
errors:
Stress command completed with bad status 1: 2022/02/18 04:44:50 EOF
2022/02/18 04:44:50 EOF
2022/02/18 04:44:50 EOF
2022/02/18 04:44:50 EOF
2022
====================================
Restore Monitor Stack command: $ hydra investigate show-monitor d3f0b0ff-3989-41f4-9eaf-a0f510ea5895
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d3f0b0ff-3989-41f4-9eaf-a0f510ea5895
Test id: d3f0b0ff-3989-41f4-9eaf-a0f510ea5895
Logs: grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_162212/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220218_162212-longevity-twcs-48h-master-monitor-node-d3f0b0ff-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_162212/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220218_162212-longevity-twcs-48h-master-monitor-node-d3f0b0ff-1.png) db-cluster - [https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/db-cluster-d3f0b0ff.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/db-cluster-d3f0b0ff.tar.gz) loader-set - [https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/loader-set-d3f0b0ff.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/loader-set-d3f0b0ff.tar.gz) monitor-set - [https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/monitor-set-d3f0b0ff.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/monitor-set-d3f0b0ff.tar.gz) sct - [https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/sct-runner-d3f0b0ff.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d3f0b0ff-3989-41f4-9eaf-a0f510ea5895/20220218_163047/sct-runner-d3f0b0ff.tar.gz)
Reported issue is wrong, scylla dropped core, sb quits due to the reaching limit of errors:
< t:2022-02-04 21:00:23,310 f:base.py l:146 c:RemoteCmdRunner p:ERROR > Error executing command: "/$HOME/go/bin/scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000 -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.0.3.180"; Exit status: 1
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > STDOUT: : 9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > 95th: 9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > 90th: 9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > median: 9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > mean: 8m51.248701013s
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG >
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > Following critical errors where caught during the run:
< t:2022-02-04 21:00:23,321 f:base.py l:148 c:RemoteCmdRunner p:DEBUG > Error limit (maxErrorsAtRow) of 1000 errors is reached
Reported core is not related to s-b.
Installation details Kernel version:
5.11.0-1022-aws
Scylla version (or git commit hash):4.6.rc5-0.20220203.5694ec189 with build-id f5d85bf5abe6d2f9fd3487e2469ce1c34304cc14
Cluster size: 4 nodes (i3en.3xlarge) Scylla running with shards number (live nodes): longevity-large-partitions-4d-4-6-db-node-e2adc2e9-1 (16.170.220.3 | 10.0.3.180): 12 shards longevity-large-partitions-4d-4-6-db-node-e2adc2e9-2 (13.48.106.98 | 10.0.1.75): 12 shards longevity-large-partitions-4d-4-6-db-node-e2adc2e9-4 (13.51.193.35 | 10.0.3.6): 12 shards longevity-large-partitions-4d-4-6-db-node-e2adc2e9-5 (16.171.64.136 | 10.0.0.210): 12 shards Scylla running with shards number (terminated nodes): longevity-large-partitions-4d-4-6-db-node-e2adc2e9-3 (16.170.157.129 | 10.0.3.67): 12 shards OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-099a011bd5f16a168
(aws: eu-north-1)Test:
longevity-large-partition-4days-test
Test name:longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity
Test config file(s):Issue description
====================================
Two loader nodes running scylla-bench v0.1.8 got 3 core dumps:
It looks like SCT encountered a problem uploading the coredump to s3:
====================================
Restore Monitor Stack command:
$ hydra investigate show-monitor e2adc2e9-28de-4aab-8dd3-5420deabc259
Restore monitor on AWS instance using Jenkins job Show all stored logs command:$ hydra investigate show-logs e2adc2e9-28de-4aab-8dd3-5420deabc259
Test id:
e2adc2e9-28de-4aab-8dd3-5420deabc259
Logs: grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-longevity-large-partition-4days-test-scylla-per-server-metrics-nemesis-20220204_211840-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-longevity-large-partition-4days-test-scylla-per-server-metrics-nemesis-20220204_211840-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png) grafana - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-overview-20220204_211619-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_211619/grafana-screenshot-overview-20220204_211619-longevity-large-partitions-4d-4-6-monitor-node-e2adc2e9-1.png) critical - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/critical-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/critical-e2adc2e9.log.tar.gz) db-cluster - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/db-cluster-e2adc2e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/db-cluster-e2adc2e9.tar.gz) debug - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/debug-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/debug-e2adc2e9.log.tar.gz) email_data - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/email_data-e2adc2e9.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/email_data-e2adc2e9.json.tar.gz) error - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/error-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/error-e2adc2e9.log.tar.gz) event - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/events-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/events-e2adc2e9.log.tar.gz) left_processes - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/left_processes-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/left_processes-e2adc2e9.log.tar.gz) loader-set - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/loader-set-e2adc2e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/loader-set-e2adc2e9.tar.gz) monitor-set - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/monitor-set-e2adc2e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/monitor-set-e2adc2e9.tar.gz) normal - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/normal-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/normal-e2adc2e9.log.tar.gz) output - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/output-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/output-e2adc2e9.log.tar.gz) event - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/raw_events-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/raw_events-e2adc2e9.log.tar.gz) sct - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/sct-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/sct-e2adc2e9.log.tar.gz) summary - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/summary-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/summary-e2adc2e9.log.tar.gz) warning - [https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/warning-e2adc2e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e2adc2e9-28de-4aab-8dd3-5420deabc259/20220204_213006/warning-e2adc2e9.log.tar.gz)
Jenkins job URL