Open bentsi opened 4 years ago
Can we correlate the messages form scylla-jmx to the node's logs? Was the node bootstrapping for a long time?
starting scylla
2020-04-22T11:12:00+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb with build-id efad5bfb73c2ce2191c69bfbf882efbba4e3d0ac starting ...
Scylla started after 27 seconds
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] storage_service - Thrift server listening on 10.0.67.95:9160 ...
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] init - serving
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb initialization completed.
only then JMX tried to connect to the port 10000 (but the service started earlier 2020-04-22T11:12:01+00:00) that is a bit strange.
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Starting the JMX server
I investigated it a bit and that's what I found:
Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip
There are no changes in scylla-jmx
between 3.3 and 4.0. We have the following changes between 3.2 and 3.3 that might have some impact on startup:
The issue could be in scylla.git
, of course. @slivne asked about systems slices, but those were introduced in Scylla 3.2 already.
@bentsi You wouldn't happen to have systemd
logs stashed anywhere for the slow JMX startup case? To narrow down the issue, we need to first understand if the problem is jmx not starting, or jmx starting but stalling.
@penberg
Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip
Look for log of node gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3
and check timestamps that I mentioned in the issue description
If I looked up the timestamps, correctly, we have CQL ready to serve requests at 11:12:27:
2020-04-22T11:12:27+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)
JMX service starts 14 seconds later:
2020-04-22T11:12:41+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Using config file: /etc/scylla/scylla.yaml
However, as you say, it takes another 24 seconds before JMX is able to connect to Scylla REST API and start serving JMX requests:
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00 gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO | scylla-jmx: Starting the JMX server
It looks like connecting to the API server takes 24 seconds. If this is a regression, then the first thing I can think of is commit d8c47603d9ad042e18cea2f1fbaf0b6bf2c298b4. However, this commit is already part of Scylla 3.3, so it's not that recent.
One way to debug this further is to increase the logging of Jersey for Scylla JXM. It's been too long for me to remember how to configure that, but I think Jersey is using java.util
logging (let's CC @elcallio and @tarzanek who usually remember this stuff).
those 24s are spent in static inits for API as Pekka correctly pointed out
private static final APIConfig config = new APIConfig();
public static final APIClient client = new APIClient(config);
public static void main(String[] args) throws Exception {
System.out.println("Connecting to " + config.getBaseUrl());
I checked release notes of jersey and they fixed bugs around this so it would make sense to update at least to 2.29.1 and retest
that said, I just tested with old 2.22.1 and with master from scylla-jmx and scylla and for me the jersey connection to API was instant so I think the problem might be in the environment too
getent hosts localhost
(same for IP), or nslookup worst case)(btw. I tested 2.29.1 which compiles well and works well for me too, perhaps it might make sense to run this test with 2.29.1 - how do I run just this test Bentsi? )
I tried to reproduce too and failed. @bentsi how can we check the issues @tarzanek pointed out? IOW, is there a problem with address resolution. One possible culprit could be Scylla's HTTP server too.
affected versions: master and 4.0 It takes about 90 seconds JMX to start receiving connections.
Possibly related to #98