scylladb / scylla-jmx

Scylla JMX proxy
GNU Affero General Public License v3.0
28 stars 51 forks source link

Scylla jmx takes longer than normal (over a minute) to start #195

Open ShlomiBalalis opened 1 year ago

ShlomiBalalis commented 1 year ago

Issue description

In this run, we ran the rebuild_streaming_err for several times. In this nemesis, we do the following:

  1. stop scylla
  2. delete a few sstable files (of non-system keyspaces)
  3. start scylla
  4. execute a nodetool rebuild
  5. interrupt it by rebooting the node instance
  6. executing a nodetool rebuild, this time to completion

The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.

2023-03-09T14:38:04+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla[589]: Scylla version 5.3.0~dev-0.20230305.95ce2e898016 with build-id 82d6475d1db2cda61d9de9fd3a060e821b356d36 starting ...
2023-03-09T14:38:04+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla[589]:  [shard 0] init - Scylla version 5.3.0~dev-0.20230305.95ce2e898016 with build-id 82d6475d1db2cda61d9de9fd3a060e821b356d36 starting ...
...
2023-03-09T14:38:05+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | systemd[1]: Started Scylla JMX.
...
2023-03-09T14:38:30+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla[589]:  [shard 0] init - Scylla version 5.3.0~dev-0.20230305.95ce2e898016 initialization completed.
...
2023-03-09T14:39:00+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla-jmx[601]: Connecting to http://127.0.0.1:10000
2023-03-09T14:39:00+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla-jmx[601]: Starting the JMX server
...
2023-03-09T14:39:47+00:00 longevity-5gb-1h-RebuildStreamingEr-db-node-1011ef9b-2     !INFO | scylla-jmx[601]: JMX is enabled to receive remote connections on port: 7199

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

The issue was 100% consistent, and caused all of the 6 nemesis runs in this build to fail.

Installation details

Kernel Version: 5.15.0-1031-aws Scylla version (or git commit hash): 5.3.0~dev-20230305.95ce2e898016 with build-id 82d6475d1db2cda61d9de9fd3a060e821b356d36

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

OS / Image: ami-0122c861a035f0d11 (aws: eu-west-1)

Test: longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test Test id: 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293 Test name: scylla-master/nemesis/longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1011ef9b-bc10-44d9-9c6a-f08ff1e5d293) - Show all stored logs command: `$ hydra investigate show-logs 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293` ## Logs: - **db-cluster-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/db-cluster-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/db-cluster-1011ef9b.tar.gz) - **sct-runner-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/sct-runner-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/sct-runner-1011ef9b.tar.gz) - **monitor-set-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/monitor-set-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/monitor-set-1011ef9b.tar.gz) - **loader-set-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/loader-set-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/loader-set-1011ef9b.tar.gz) - **parallel-timelines-report-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/parallel-timelines-report-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/parallel-timelines-report-1011ef9b.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/nemesis/job/longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test/2/)
mykaul commented 1 year ago

DNS... (resolving localhost, etc.) ?

ShlomiBalalis commented 1 year ago

DNS... (resolving localhost, etc.) ?

If it is, it's weirdly consistent

mykaul commented 1 year ago

DNS... (resolving localhost, etc.) ?

If it is, it's weirdly consistent

On different runs, different clusters, etc.?

avikivity commented 1 year ago

The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.

What version is "previously"?

ShlomiBalalis commented 1 year ago

DNS... (resolving localhost, etc.) ?

If it is, it's weirdly consistent

On different runs, different clusters, etc.?

In this run, every time the node was restarted.

The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.

What version is "previously"?

Here's a run that uses 5.2.0~rc1-0.20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528 for example:

2023-02-11T05:30:29+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla[758]:  [shard  0] init - Scylla version 5.2.0~rc1-0.20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528 starting ...
...
2023-02-11T05:30:33+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | systemd[1]: Started Scylla JMX.
...
2023-02-11T05:30:34+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla-jmx[869]: Starting the JMX server
...
2023-02-11T05:30:35+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla-jmx[869]: JMX is enabled to receive remote connections on port: 7199
...
2023-02-11T05:31:01+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla[758]:  [shard  0] init - Scylla version 5.2.0~rc1-0.20230207.8ff4717fd010 initialization completed.

Here it took only 2 seconds for the JMX server to be up, and it started before scylla was successfully initialized. This is the standard behaviour.

avikivity commented 1 year ago

But on other nodes, it starts quickly?

fgelcer commented 1 year ago

But on other nodes, it starts quickly?

@ShlomiBalalis , can you please confirm it?