Loader node with scylla-bench v0.1.8 got a core dump

yarongilor commented 2 years ago

Installation details Kernel version: 5.11.0-1022-aws Scylla version (or git commit hash): 4.6.rc5-0.20220203.5694ec189 with build-id f5d85bf5abe6d2f9fd3487e2469ce1c34304cc14 Cluster size: 4 nodes (i3en.3xlarge) Scylla running with shards number (live nodes): longevity-large-partitions-4d-4-6-db-node-e2adc2e9-1 (16.170.220.3 | 10.0.3.180): 12 shards longevity-large-partitions-4d-4-6-db-node-e2adc2e9-2 (13.48.106.98 | 10.0.1.75): 12 shards longevity-large-partitions-4d-4-6-db-node-e2adc2e9-4 (13.51.193.35 | 10.0.3.6): 12 shards longevity-large-partitions-4d-4-6-db-node-e2adc2e9-5 (16.171.64.136 | 10.0.0.210): 12 shards Scylla running with shards number (terminated nodes): longevity-large-partitions-4d-4-6-db-node-e2adc2e9-3 (16.170.157.129 | 10.0.3.67): 12 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-099a011bd5f16a168 (aws: eu-north-1)

Test: longevity-large-partition-4days-test Test name: longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity Test config file(s):

longevity-large-partition-4days.yaml

Issue description

====================================

Two loader nodes running scylla-bench v0.1.8 got 3 core dumps:

2022-02-04 19:12:38.443: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=808c9044-c851-4fe5-884a-c7217aa8d4c7 node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-2 [16.170.143.136 | 10.0.3.155] (seed: False)
2022-02-04 19:30:02.330: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=39c8e89c-222a-4e89-bfef-a0ad19fe9903 node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False)
2022-02-04 20:50:58.048: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=c3e3ce43-fa0c-4381-aa90-2bfd04b8eb7c node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False)

It looks like SCT encountered a problem uploading the coredump to s3:

< t:2022-02-04 19:30:02,331 f:file_logger.py  l:89   c:sdcm.sct_events.file_logger p:INFO  > 2022-02-04 19:30:02.330: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=39c8e89c-222a-4e89-bfef-a0ad19fe9903 node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125
] (seed: False)
< t:2022-02-04 19:30:33,724 f:coredump.py     l:389  c:sdcm.cluster_aws     p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: Failed to convert date 'Timestamp: Fri 2022-02-04 19:13:57 UTC (16min ago)' (Fri 2022-02-
04 19:13:57 UTC), due to error: time data 'Fri 2022-02-04 19:13:57 UTC' does not match format '%a %Y-%m-%d %H:%M:%S %z'
< t:2022-02-04 19:30:33,725 f:coredump.py     l:220  c:sdcm.cluster_aws     p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: CoreDump[859] has inaccessible corefile, can't upload it
< t:2022-02-04 20:50:58,050 f:file_logger.py  l:89   c:sdcm.sct_events.file_logger p:INFO  > 2022-02-04 20:50:58.048: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=c3e3ce43-fa0c-4381-aa90-2bfd04b8eb7c node=Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125
] (seed: False)
< t:2022-02-04 20:51:58,811 f:coredump.py     l:389  c:sdcm.cluster_aws     p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: Failed to convert date 'Timestamp: Fri 2022-02-04 20:35:03 UTC (16min ago)' (Fri 2022-02-
04 20:35:03 UTC), due to error: time data 'Fri 2022-02-04 20:35:03 UTC' does not match format '%a %Y-%m-%d %H:%M:%S %z'
< t:2022-02-04 20:51:58,811 f:coredump.py     l:220  c:sdcm.cluster_aws     p:ERROR > Node longevity-large-partitions-4d-4-6-loader-node-e2adc2e9-1 [13.48.13.196 | 10.0.3.125] (seed: False): CoredumpExportSystemdThread: CoreDump[6632] has inaccessible corefile, can't upload it

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor e2adc2e9-28de-4aab-8dd3-5420deabc259 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs e2adc2e9-28de-4aab-8dd3-5420deabc259

Test id: e2adc2e9-28de-4aab-8dd3-5420deabc259

Jenkins job URL

aleksbykov commented 2 years ago

Issue again reproduced with job: Installation details Kernel version: 5.11.0-1028-aws Scylla version (or git commit hash): 5.1.dev-0.20220209.5099b1e27 with build-id b0986550af32b8da96b50442a53423047ed91696 Cluster size: 4 nodes (i3en.2xlarge) Scylla running with shards number (live nodes): longevity-twcs-48h-master-db-node-1a3d093d-1 (34.248.68.217 | 10.0.2.75): 8 shards longevity-twcs-48h-master-db-node-1a3d093d-4 (34.247.155.5 | 10.0.2.44): 8 shards longevity-twcs-48h-master-db-node-1a3d093d-5 (3.250.79.246 | 10.0.3.237): 8 shards longevity-twcs-48h-master-db-node-1a3d093d-6 (34.244.37.56 | 10.0.0.57): 8 shards Scylla running with shards number (terminated nodes): longevity-twcs-48h-master-db-node-1a3d093d-2 (54.229.206.88 | 10.0.3.42): 8 shards longevity-twcs-48h-master-db-node-1a3d093d-3 (34.253.105.243 | 10.0.2.98): 8 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0f894d9df1a4e76fc (aws: eu-west-1)

Test: longevity-twcs-48h-test Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

longevity-twcs-48h.yaml

Issue description

====================================

2022-02-11 10:30:08.885: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=c9442f32-bde5-489c-991b-0a0de71e893d node=Node longevity-twcs-48h-master-loader-node-1a3d093d-3 [54.170.161.43 | 10.0.2.226] (seed: False)

022-02-11 10:42:56.479: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=b1610c44-c0de-44cf-ae95-7915a0ffbf67 duration=6h10m36s: node=Node longevity-twcs-48h-master-loader-node-1a3d093d-1 [54.217.57.220 | 10.0.3.227] (seed: False) stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000 errors:

Stress command completed with bad status -1: 2022/02/11 04:37:47 EOF 2022/02/11 04:37:47 EOF 2022/02/11 04:37:47 EOF 2022/02/11 04:37:47 EOF 2022

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 1a3d093d-697d-40cc-8909-8949d48797b8 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs 1a3d093d-697d-40cc-8909-8949d48797b8

Test id: 1a3d093d-697d-40cc-8909-8949d48797b8

Jenkins job URL

KnifeyMoloko commented 2 years ago

Another reproduced instance: Installation details Kernel version: 5.11.0-1028-aws Scylla version (or git commit hash): 5.1.dev-0.20220217.69fcc053b with build-id b8415b1ebbffff2b4183734680f4afab3bfed86d Cluster size: 4 nodes (i3en.2xlarge) Scylla running with shards number (live nodes): longevity-twcs-48h-master-db-node-d3f0b0ff-1 (3.250.216.253 | 10.0.3.124): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-2 (34.242.112.161 | 10.0.2.63): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-4 (52.18.71.120 | 10.0.0.189): 8 shards longevity-twcs-48h-master-db-node-d3f0b0ff-5 (54.194.249.252 | 10.0.0.178): 8 shards Scylla running with shards number (terminated nodes): longevity-twcs-48h-master-db-node-d3f0b0ff-3 (54.154.245.131 | 10.0.2.210): 8 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-041d8500e7cf30167 (aws: eu-west-1)

Test: longevity-twcs-48h-test Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

longevity-twcs-48h.yaml

Issue description

====================================

2022-02-18 16:16:29.292: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=506e1e7b-88aa-4cab-a6a4-45168f5bb513 duration=11h31m49s: node=Node longevity-twcs-48h-master-loader-node-d3f0b0ff-2 [18.202.227.20 | 10.0.1.29] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=20000 -concurrency=100 -replication-factor=3 -clustering-row-count=10000000 -clustering-row-size=200  -rows-per-request=100 -start-timestamp=GET_WRITE_TIMESTAMP -write-rate 125 -distribution hnormal --connection-count 100 -duration=2880m -error-at-row-limit 1000
errors:
Stress command completed with bad status 1: 2022/02/18 04:44:50 EOF
2022/02/18 04:44:50 EOF
2022/02/18 04:44:50 EOF
2022/02/18 04:44:50 EOF
2022

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor d3f0b0ff-3989-41f4-9eaf-a0f510ea5895 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs d3f0b0ff-3989-41f4-9eaf-a0f510ea5895

Test id: d3f0b0ff-3989-41f4-9eaf-a0f510ea5895

Jenkins job URL

dkropachev commented 2 years ago

Reported issue is wrong, scylla dropped core, sb quits due to the reaching limit of errors:

< t:2022-02-04 21:00:23,310 f:base.py         l:146  c:RemoteCmdRunner      p:ERROR > Error executing command: "/$HOME/go/bin/scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000  -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.0.3.180"; Exit status: 1
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG > STDOUT: :          9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG >   95th:            9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG >   90th:            9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG >   median:  9m0.092137471s
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG >   mean:            8m51.248701013s
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG >
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG > Following critical errors where caught during the run:
< t:2022-02-04 21:00:23,321 f:base.py         l:148  c:RemoteCmdRunner      p:DEBUG >     Error limit (maxErrorsAtRow) of 1000 errors is reached

Reported core is not related to s-b.

scylladb / scylla-bench

Loader node with scylla-bench v0.1.8 got a core dump #90