scylladb / scylla-jmx

Scylla JMX proxy
GNU Affero General Public License v3.0
28 stars 50 forks source link

jmx takes ~90 seconds to start receiving connections #100

Open bentsi opened 4 years ago

bentsi commented 4 years ago

affected versions: master and 4.0 It takes about 90 seconds JMX to start receiving connections.

2020-04-22T11:12:01+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | systemd: Started Scylla Server.

2020-04-22T11:12:01+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | systemd: Started Scylla JMX.

2020-04-22T11:12:04+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Picked up JAVA_TOOL_OPTIONS:

2020-04-22T11:12:41+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Using config file: /etc/scylla/scylla.yaml

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Connecting to http://127.0.0.1:10000

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Starting the JMX server

2020-04-22T11:13:39+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: JMX is enabled to receive remote connections on port: 7199 

Possibly related to #98

bhalevy commented 4 years ago

Can we correlate the messages form scylla-jmx to the node's logs? Was the node bootstrapping for a long time?

bentsi commented 4 years ago

starting scylla

2020-04-22T11:12:00+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb with build-id efad5bfb73c2ce2191c69bfbf882efbba4e3d0ac starting ...

Scylla started after 27 seconds

2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)
2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] storage_service - Thrift server listening on 10.0.67.95:9160 ...
2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] init - serving
2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] init - Scylla version 4.0.rc2-0.20200421.89e79023aeb initialization completed.

only then JMX tried to connect to the port 10000 (but the service started earlier 2020-04-22T11:12:01+00:00) that is a bit strange.

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Starting the JMX server

I investigated it a bit and that's what I found:

image

Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip

penberg commented 4 years ago

There are no changes in scylla-jmx between 3.3 and 4.0. We have the following changes between 3.2 and 3.3 that might have some impact on startup:

The issue could be in scylla.git, of course. @slivne asked about systems slices, but those were introduced in Scylla 3.2 already.

penberg commented 4 years ago

@bentsi You wouldn't happen to have systemd logs stashed anywhere for the slow JMX startup case? To narrow down the issue, we need to first understand if the problem is jmx not starting, or jmx starting but stalling.

bentsi commented 4 years ago

@penberg

Full DB log here: https://cloudius-jenkins-test.s3.amazonaws.com/38f7fe64-67d0-43e1-b6e2-e16ef723e314/20200422_125722/db-cluster-38f7fe64.zip

Look for log of node gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 and check timestamps that I mentioned in the issue description

penberg commented 4 years ago

If I looked up the timestamps, correctly, we have CQL ready to serve requests at 11:12:27:

2020-04-22T11:12:27+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla: [shard 0] storage_service - Starting listening for CQL clients on 10.0.67.95:9042 (unencrypted)

JMX service starts 14 seconds later:

2020-04-22T11:12:41+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Using config file: /etc/scylla/scylla.yaml

However, as you say, it takes another 24 seconds before JMX is able to connect to Scylla REST API and start serving JMX requests:

2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Connecting to http://127.0.0.1:10000
2020-04-22T11:13:05+00:00  gemini-with-nemesis-3h-normal-4-0-db-node-38f7fe64-3 !INFO    | scylla-jmx: Starting the JMX server

It looks like connecting to the API server takes 24 seconds. If this is a regression, then the first thing I can think of is commit d8c47603d9ad042e18cea2f1fbaf0b6bf2c298b4. However, this commit is already part of Scylla 3.3, so it's not that recent.

One way to debug this further is to increase the logging of Jersey for Scylla JXM. It's been too long for me to remember how to configure that, but I think Jersey is using java.util logging (let's CC @elcallio and @tarzanek who usually remember this stuff).

tarzanek commented 4 years ago

those 24s are spent in static inits for API as Pekka correctly pointed out

private static final APIConfig config = new APIConfig();
public static final APIClient client = new APIClient(config);

    public static void main(String[] args) throws Exception {
        System.out.println("Connecting to " + config.getBaseUrl());

I checked release notes of jersey and they fixed bugs around this so it would make sense to update at least to 2.29.1 and retest

that said, I just tested with old 2.22.1 and with master from scylla-jmx and scylla and for me the jersey connection to API was instant so I think the problem might be in the environment too

(btw. I tested 2.29.1 which compiles well and works well for me too, perhaps it might make sense to run this test with 2.29.1 - how do I run just this test Bentsi? )

penberg commented 4 years ago

I tried to reproduce too and failed. @bentsi how can we check the issues @tarzanek pointed out? IOW, is there a problem with address resolution. One possible culprit could be Scylla's HTTP server too.