Open ShlomiBalalis opened 1 year ago
DNS... (resolving localhost, etc.) ?
DNS... (resolving localhost, etc.) ?
If it is, it's weirdly consistent
DNS... (resolving localhost, etc.) ?
If it is, it's weirdly consistent
On different runs, different clusters, etc.?
The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.
What version is "previously"?
DNS... (resolving localhost, etc.) ?
If it is, it's weirdly consistent
On different runs, different clusters, etc.?
In this run, every time the node was restarted.
The issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.
What version is "previously"?
Here's a run that uses 5.2.0~rc1-0.20230207.8ff4717fd010
with build-id 78fbb2c25e9244a62f57988313388a0260084528
for example:
2023-02-11T05:30:29+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4 !INFO | scylla[758]: [shard 0] init - Scylla version 5.2.0~rc1-0.20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528 starting ...
...
2023-02-11T05:30:33+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4 !INFO | systemd[1]: Started Scylla JMX.
...
2023-02-11T05:30:34+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4 !INFO | scylla-jmx[869]: Starting the JMX server
...
2023-02-11T05:30:35+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4 !INFO | scylla-jmx[869]: JMX is enabled to receive remote connections on port: 7199
...
2023-02-11T05:31:01+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4 !INFO | scylla[758]: [shard 0] init - Scylla version 5.2.0~rc1-0.20230207.8ff4717fd010 initialization completed.
Here it took only 2 seconds for the JMX server to be up, and it started before scylla was successfully initialized. This is the standard behaviour.
But on other nodes, it starts quickly?
But on other nodes, it starts quickly?
@ShlomiBalalis , can you please confirm it?
Issue description
In this run, we ran the
rebuild_streaming_err
for several times. In this nemesis, we do the following:nodetool rebuild
nodetool rebuild
, this time to completionThe issue is that the jmx server takes an abnormal amount of time to start, such that it's only ready to receive requests over a minute after scylla starts, while previously it was a matter of ~10 seconds.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
The issue was 100% consistent, and caused all of the 6 nemesis runs in this build to fail.
Installation details
Kernel Version: 5.15.0-1031-aws Scylla version (or git commit hash):
5.3.0~dev-20230305.95ce2e898016
with build-id82d6475d1db2cda61d9de9fd3a060e821b356d36
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-0122c861a035f0d11
(aws: eu-west-1)Test:
longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test
Test id:1011ef9b-bc10-44d9-9c6a-f08ff1e5d293
Test name:scylla-master/nemesis/longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1011ef9b-bc10-44d9-9c6a-f08ff1e5d293) - Show all stored logs command: `$ hydra investigate show-logs 1011ef9b-bc10-44d9-9c6a-f08ff1e5d293` ## Logs: - **db-cluster-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/db-cluster-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/db-cluster-1011ef9b.tar.gz) - **sct-runner-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/sct-runner-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/sct-runner-1011ef9b.tar.gz) - **monitor-set-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/monitor-set-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/monitor-set-1011ef9b.tar.gz) - **loader-set-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/loader-set-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/loader-set-1011ef9b.tar.gz) - **parallel-timelines-report-1011ef9b.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/parallel-timelines-report-1011ef9b.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1011ef9b-bc10-44d9-9c6a-f08ff1e5d293/20230309_155203/parallel-timelines-report-1011ef9b.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/nemesis/job/longevity-5gb-1h-RebuildStreamingErrMonkey-aws-test/2/)