scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.54k stars 1.29k forks source link

Node bootstrap time passes 23 minutes #9890

Open ShlomiBalalis opened 2 years ago

ShlomiBalalis commented 2 years ago

Installation details Scylla version (or git commit hash):4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 Cluster size:4 OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-0668e251eb4f0f394 (us-east-1) Scenario: https://github.com/scylladb/scylla-cluster-tests/blob/master/test-cases/longevity/longevity-twcs-48h.yaml

The first 4 nodes in the cluster has started within 2 minutes at most: node 1 (3.250.78.230 | 10.0.3.58):

2021-12-31T04:12:12+00:00 longevity-twcs-48h-master-db-node-274b80e1-1 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T04:12:38+00:00 longevity-twcs-48h-master-db-node-274b80e1-1 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

node 2 (54.246.205.54 | 10.0.1.13):

2021-12-31T04:14:42+00:00 longevity-twcs-48h-master-db-node-274b80e1-2 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T04:16:23+00:00 longevity-twcs-48h-master-db-node-274b80e1-2 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

node 3 (52.214.78.106 | 10.0.3.183):

2021-12-31T04:18:16+00:00 longevity-twcs-48h-master-db-node-274b80e1-3 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T04:19:59+00:00 longevity-twcs-48h-master-db-node-274b80e1-3 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

node 4 (63.35.248.245 | 10.0.2.74):

2021-12-31T04:21:55+00:00 longevity-twcs-48h-master-db-node-274b80e1-4 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T04:23:37+00:00 longevity-twcs-48h-master-db-node-274b80e1-4 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

However, ever since the next node, the init time of each node grew longer:

node 5 (3.249.249.101 | 10.0.2.141):

2021-12-31T07:38:47+00:00 longevity-twcs-48h-master-db-node-274b80e1-5 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T07:55:06+00:00 longevity-twcs-48h-master-db-node-274b80e1-5 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

node 6 (54.154.183.144 | 10.0.1.207):

2021-12-31T08:10:38+00:00 longevity-twcs-48h-master-db-node-274b80e1-6 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T08:24:54+00:00 longevity-twcs-48h-master-db-node-274b80e1-6 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

node 7 (18.203.127.246 | 10.0.2.142):

2021-12-31T15:56:02+00:00 longevity-twcs-48h-master-db-node-274b80e1-7 !    INFO | Scylla version 4.7.dev-0.20211230.12fa68fe6 with build-id 938d162cf1be1483853bf5b5417f5daf11e5ac24 starting ...
2021-12-31T16:19:01+00:00 longevity-twcs-48h-master-db-node-274b80e1-7 !    INFO |  [shard 0] init - Scylla version 4.7.dev-0.20211230.12fa68fe6 initialization completed.

It seems that the part of the bootstrap process that caused the increase in initialization time is the range_streamer:

node 2 (54.246.205.54 | 10.0.1.13):

< t:2021-12-31 04:16:11,405 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T04:16:11+00:00 longevity-twcs-48h-master-db-node-274b80e1-2 !    INFO |  [shard 0] range_streamer - Bootstrap starts, nr_ranges_remaining=1796
< t:2021-12-31 04:16:11,422 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T04:16:11+00:00 longevity-twcs-48h-master-db-node-274b80e1-2 !    INFO |  [shard 0] range_streamer - Bootstrap succeeded, took 0 seconds, nr_ranges_remaining=0

node 3 (52.214.78.106 | 10.0.3.183):

< t:2021-12-31 04:19:47,498 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T04:19:47+00:00 longevity-twcs-48h-master-db-node-274b80e1-3 !    INFO |  [shard 0] range_streamer - Bootstrap starts, nr_ranges_remaining=2297
< t:2021-12-31 04:19:47,525 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T04:19:47+00:00 longevity-twcs-48h-master-db-node-274b80e1-3 !    INFO |  [shard 0] range_streamer - Bootstrap succeeded, took 0 seconds, nr_ranges_remaining=0

node 4 (63.35.248.245 | 10.0.2.74):

< t:2021-12-31 04:23:25,169 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T04:23:24+00:00 longevity-twcs-48h-master-db-node-274b80e1-4 !    INFO |  [shard 0] range_streamer - Bootstrap starts, nr_ranges_remaining=2524
< t:2021-12-31 04:23:25,512 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T04:23:25+00:00 longevity-twcs-48h-master-db-node-274b80e1-4 !    INFO |  [shard 0] range_streamer - Bootstrap succeeded, took 0 seconds, nr_ranges_remaining=0

node 5 (3.249.249.101 | 10.0.2.141):

< t:2021-12-31 08:12:08,525 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 021-12-31T08:12:08+00:00 longevity-twcs-48h-master-db-node-274b80e1-6 !    INFO |  [shard 0] range_streamer - Bootstrap starts, nr_ranges_remaining=15695
< t:2021-12-31 08:24:36,570 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T08:24:36+00:00 longevity-twcs-48h-master-db-node-274b80e1-6 !    INFO |  [shard 0] range_streamer - Bootstrap succeeded, took 748 seconds, nr_ranges_remaining=0

node 6 (54.154.183.144 | 10.0.1.207):

< t:2021-12-31 08:41:12,659 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 021-12-31T08:41:12+00:00 longevity-twcs-48h-master-db-node-274b80e1-5 !    INFO |  [shard 0] range_streamer - Unbootstrap starts, nr_ranges_remaining=13339
< t:2021-12-31 08:56:09,316 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T08:56:09+00:00 longevity-twcs-48h-master-db-node-274b80e1-5 !    INFO |  [shard 0] range_streamer - Unbootstrap succeeded, took 896 seconds, nr_ranges_remaining=0

node 7 (18.203.127.246 | 10.0.2.142):

< t:2021-12-31 15:57:33,875 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 021-12-31T15:57:33+00:00 longevity-twcs-48h-master-db-node-274b80e1-7 !    INFO |  [shard 0] range_streamer - Bootstrap starts, nr_ranges_remaining=14098
< t:2021-12-31 16:18:32,094 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2021-12-31T16:18:31+00:00 longevity-twcs-48h-master-db-node-274b80e1-7 !    INFO |  [shard 0] range_streamer - Bootstrap succeeded, took 1258 seconds, nr_ranges_remaining=0

While it makes sense that nodes 5, 6 and 7 would take longer since the size of the data in the cluster is larger after the first 4 node have started, it should not reach this kind of time span.

Logs:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                        Log links for testrun with test id 274b80e1-21b2-456d-8602-4f3aac4f5df7                                                                                                         |
+-----------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Date            | Log type    | Link                                                                                                                                                                                                                                                   |
+-----------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 20211231_162004 | grafana     | https://cloudius-jenkins-test.s3.amazonaws.com/274b80e1-21b2-456d-8602-4f3aac4f5df7/20211231_162004/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20211231_162231-longevity-twcs-48h-master-monitor-node-274b80e1-1.png |
| 20211231_162004 | grafana     | https://cloudius-jenkins-test.s3.amazonaws.com/274b80e1-21b2-456d-8602-4f3aac4f5df7/20211231_162004/grafana-screenshot-overview-20211231_162004-longevity-twcs-48h-master-monitor-node-274b80e1-1.png                                                  |
| 20211231_163324 | db-cluster  | https://cloudius-jenkins-test.s3.amazonaws.com/274b80e1-21b2-456d-8602-4f3aac4f5df7/20211231_163324/db-cluster-274b80e1.tar.gz                                                                                                                         |
| 20211231_163324 | loader-set  | https://cloudius-jenkins-test.s3.amazonaws.com/274b80e1-21b2-456d-8602-4f3aac4f5df7/20211231_163324/loader-set-274b80e1.tar.gz                                                                                                                         |
| 20211231_163324 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/274b80e1-21b2-456d-8602-4f3aac4f5df7/20211231_163324/monitor-set-274b80e1.tar.gz                                                                                                                        |
| 20211231_163324 | sct         | https://cloudius-jenkins-test.s3.amazonaws.com/274b80e1-21b2-456d-8602-4f3aac4f5df7/20211231_163324/sct-runner-274b80e1.tar.gz                                                                                                                         |
+-----------------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
roydahan commented 2 years ago

Please edit the description above so someone can understand the issue if there is actually an issue.

roydahan commented 2 years ago

Closing for now, we will reopen if this reproduces.