scylladb / scylla-ccm

Cassandra Cluster Manager, modified for Scylla
Apache License 2.0
22 stars 68 forks source link

JMX doesn't start #456

Closed nyh closed 1 year ago

nyh commented 1 year ago

I'm trying to run dtest on my machine, and the cluster fails to start reporting that it can't connect to JMX:

>       self.cluster.start(wait_other_notice=True, wait_for_binary_proto=True)
...
        if not self._wait_java_up(ip_addr, jmx_port):
            e_msg = "Error starting node {}: unable to connect to scylla-jmx port {}:{}".format(
                     [self.name](http://self.name/), ip_addr, jmx_port)
>           raise NodeError(e_msg, scylla_process)
E           ccmlib.node.NodeError: Error starting node node1: unable to connect to scylla-jmx port [127.0.71.1:7199](http://127.0.71.1:7199/)

In logs/system.log.jmx I see only this:

Starting scylla-jmx: args=['/home/nyh/.dtest/dtest-j7k_32_1/test/node1/bin/symlinks/scylla-jmx', '-Dapiaddress=127.0.93.1', '-Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder', '-Djava.rmi.server.hostname=127.0.93.1', '-Dcom.sun.management.jmxremote', '-Dcom.sun.management.jmxremote.host=127.0.93.1', '-Dcom.sun.management.jmxremote.port=7199', '-Dcom.sun.management.jmxremote.rmi.port=7199', '-Dcom.sun.management.jmxremote.local.only=false', '-Xmx256m', '-XX:+UseSerialGC', '-Dcom.sun.management.jmxremote.authenticate=false', '-Dcom.sun.management.jmxremote.ssl=false', '-jar', '/home/nyh/.dtest/dtest-j7k_32_1/test/node1/bin/scylla-jmx-1.0.jar']

I don't see any other output in this log, which I also want to report as a bug, but of course what bothers me more is why it doesn't start.

I see in the above command line that it runs node1/bin/symlinks/scylla-jmx as Java. But this file is (I checked now) a link to /usr/bin/java. But my /usr/bin/java is Java 17, and if I understand correctly, this isn't supported by JMX :-(

I see in the CCM code there is some attempt to find among the Javas I have installed if I have 8 or 11 installed (I have Fedora 37, and both Java 8 and 11 installed!), but perhaps for some reason this doesn't work?

CC @fruch

mykaul commented 1 year ago

Might be related to some changes @tchaikov has been doing in this area?

nyh commented 1 year ago

Might be related to some changes @tchaikov has been doing in this area?

I don't know, but now I'm having the pleasure of learning the implementations of dtest and ccm and debugging JMX :-)

nyh commented 1 year ago

The reason why I don't see any output is because Java (17) when run with the command line:

/usr/bin/java -Dapiaddress=127.0.25.1 -Djavax.management.builder.initial=com.scylladb.jmx.utils.APIBuilder -Djava.rmi.server.hostname=127.0.25.1 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.host=127.0.25.1 -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.local.only=false -Xmx256m -XX:+UseSerialGC -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -jar /home/nyh/.dtest/dtest-5ufo3lr1/test/node1/bin/scylla-jmx-1.0.jar

Exits immediately and silently (but with an error code 1). I don't know why, but if I run it with Java 11 instead of my default Java 17, it does work correctly.

So the bug is why Java 17 was chosen, I'll look into this now.

fruch commented 1 year ago

you have this commit of JMX ? https://github.com/scylladb/scylla-jmx/commit/1fd23b60f144a8491993c3761c5a03d33bad3393

fruch commented 1 year ago

the select-java might be the cause of what you are seeing, jmx is trying to pick the correct java for it to work.

nyh commented 1 year ago

you have this commit of JMX ? scylladb/scylla-jmx@1fd23b6

Yes, I do, I took the most recent version of scylla (and its tools/jmx directory), compiled jmx, and also took the most recent versions of ccm and dtest. I'm not slowly familiarizing myself with all these codes and adding printouts to understand what exactly runs and why, until I get to the bottom of this.

fruch commented 1 year ago

last time I've debugged this one by add bash -x at the top of it

🟢 ❯ /home/fruch/.ccm/scylla-repository/unstable/master/2023-05-21T17_02_29Z/jmx/select-java --version
+ expected_java_versions='11 1.8'
++ select_java 11 1.8
++ local expected_java_versions
++ expected_java_versions='11 1.8'
++ for expected_java_version in $expected_java_versions
++ for java_home in /usr/lib/jvm/*
++ java=/usr/lib/jvm/default-java/bin/java
+++ /usr/lib/jvm/default-java/bin/java -XshowSettings:properties -version
+++ awk '-F = ' '/java.specification.version/ {print $NF}'
++ javaver=11
++ '[' 11 = 11 ']'
++ echo /usr/lib/jvm/default-java/bin/java
++ return
+ java=/usr/lib/jvm/default-java/bin/java
+ '[' -z /usr/lib/jvm/default-java/bin/java ']'
+ exec /usr/lib/jvm/default-java/bin/java --version
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)

seem like that at my case it's finding the correct java, even that I have multiple versions of it.

nyh commented 1 year ago
$ $HOME/scylla/tools/jmx/scripts/select-java --version
openjdk 11.0.18 2023-01-17

so I guess it's not being run because it were, it would have found the right version. Do you know off-hand who creates the symlink home/nyh/.dtest/dtest-5ufo3lr1/test/node1/bin/symlinks/scylla-jmx so I don't have to continue debugging this?

nyh commented 1 year ago

I checked, and neither scylla-ccm or scylla-dtest even use this "select-java" script :-( Instead, ccmlib/common.py has its own Java-hunting logic. Something is probably broken there.... I'm continuing to debug.

nyh commented 1 year ago

@fruch it seems code that you added in 66e9b4c28f7e72a27acfec6928233cb767f461e5 Does it actually work? :-(

fruch commented 1 year ago

@fruch it seems code that you added in 66e9b4c Does it actually work? :-(

that code was added for running cassandra, not for JMX

fruch commented 1 year ago

the assumption is that JMX on it's own would run correctly, but not that you face it, I can confirm that install JDK 17, does break CCM start...

fruch commented 1 year ago

I did debug that select-java script in artifact test, when scylla was installed by unified package, but ccm doesn't run scylla with systemd configuration, hence this was never used by CCM

nyh commented 1 year ago

@fruch it seems code that you added in 66e9b4c Does it actually work? :-(

that code was added for running cassandra, not for JMX

So nobody ever added similar code for running JMX? JMX is always run with /usr/bin/java? Again, $HOME/scylla/tools/jmx/scripts/select-java exists, and works, but ccm doesn't seem to use it at all.

I want to fix this, but don't know what to do, don't we already have too many copies of this java-detection code already? I don't think I should run the actual $HOME/scylla/tools/jmx/scripts/select-java because that will only work on Scylla's version of JMX, right? So I need to duplicate the code?

fruch commented 1 year ago

one can set JAVA_HOME to point to the Java he wants to use or sudo update-alternatives --config java

I'll look into trying to use select-java when available to run the JMX

mykaul commented 1 year ago

One day, we'll package all those Java stuff in their own container, with their own Java, and peace will prevail.

nyh commented 1 year ago

@fruch setting JAVA_HOME didn't do anythig, test/node1/bin/symlinks/scylla-jmx still points to /usr/bin/java.

@mykaul no, a container is not a replacement for code which isn't buggy. On the contrary - all this "container solves everything" philosophy is exactly the reason why nobody notices that this code is buggy :-(

nyh commented 1 year ago

By the way, to add insult to inury, it seems that ccm and dtest have separate Java-running code (and both are separate from the one in scylla/tools/jmx) - ccm is used to start JMX, and the dtest one (jmxutils.py) appears be used to send JMX commands. The dtest README.md explains how to use JAVA_HOME but it appears (?) it changes only the dtest uses, not the ccm ones.

fruch commented 1 year ago

@nyh

try https://github.com/scylladb/scylla-ccm/pull/457, I think it might solve the issue, and let JMX select on it's own the correct java to use.

nyh commented 1 year ago

Thanks @fruch your patch works and I approved it. Now dtest works for me, finally.

A silly (and broken) link test/node1/bin/symlinks/scylla-jmx, linking to /usr/bin/java, is still being created, but it's no longer used. It would have been even better (and less confusing when we go debug this again a year from now), if this link wasn't created when not needed (I couldn't even figure out who creates it, ccm or dtest).

fruch commented 1 year ago

Thanks @fruch your patch works and I approved it. Now dtest works for me, finally.

A silly (and broken) link test/node1/bin/symlinks/scylla-jmx, linking to /usr/bin/java, is still being created, but it's no longer used. It would have been even better (and less confusing when we go debug this again a year from now), if this link wasn't created when not needed (I couldn't even figure out who creates it, ccm or dtest).

it's CCM that creates it https://github.com/scylladb/scylla-ccm/commit/a9feefe8199cf5314f4f5171cf0ec228d0139aa0

nyh commented 1 year ago

it's CCM that creates it a9feefe

Interesting. Then I don't understand why setting JAVA_HOME didn't help.

Anyway, I commented in my review of your fix (#457) that perhaps if we create this link, we should just make it link to the right place (tools/jmx/select-java, if it exists), with too benefits - less confusion (creation of a link which we don't use), and "top" will continue showing "scylla-jmx" (which I assume was the original intention) instead of "select-java".

nyh commented 1 year ago

@fruch setting JAVA_HOME didn't do anythig, test/node1/bin/symlinks/scylla-jmx still points to /usr/bin/java.

By the way, after understanding what the code does, I couldn't figure out how setting JAVA_HOME didn't help. So I went back in my shell history... and... I just had a typo in the command to set that variable, which is why it didn't work :-( But anyway, the select-java thing is a better solution than requiring people to set JAVA_HOME, so thanks.

mykaul commented 1 year ago

@fruch setting JAVA_HOME didn't do anythig, test/node1/bin/symlinks/scylla-jmx still points to /usr/bin/java.

@mykaul no, a container is not a replacement for code which isn't buggy. On the contrary - all this "container solves everything" philosophy is exactly the reason why nobody notices that this code is buggy :-(

So with a container, we bring the correct, undisputed and not user-visible or changeable or configurable Java version that fits JMX, working cross platforms (at least OS-wise), etc. Bugs, should there be any, can be still fixed in the container. What am I missing?

fruch commented 1 year ago

@fruch setting JAVA_HOME didn't do anythig, test/node1/bin/symlinks/scylla-jmx still points to /usr/bin/java. @mykaul no, a container is not a replacement for code which isn't buggy. On the contrary - all this "container solves everything" philosophy is exactly the reason why nobody notices that this code is buggy :-(

So with a container, we bring the correct, undisputed and not user-visible or changeable or configurable Java version that fits JMX, working cross platforms (at least OS-wise), etc. Bugs, should there be any, can be still fixed in the container. What am I missing?

I've led such an effort in 2019 dev hackton (https://github.com/scylladb/scylla-ccm/pull/286) No one really cared back then, and we had bigger issues to handle back then (we we still using python2 and nosetests)

It wasn't "production" ready, and wasn't touch since.

I can understand this is less convenient to who want test scylla changes, i.e. you have more step to build until you have a docker image of scylla. and dev want the option to run straight out of there scylla source (back at the beginning of 2019 it was the only option to run dtest)

nyh commented 1 year ago

@mykaul you are missing the experience of a developer who wants to modify something in this setup, and needs to face a tower of dockers (yes, that's the natural result of this philosophy) and needing to change something in the 3rd level without understand what the heck it runs. I saw it very painfully when trying to modify SCT.

Anyway, I'm not suggesting that Docker can't work - it definitely can. What I'm suggesting is that by adopting the philosophy of "let's solve every small compatibility difficulty with docker", you just stop solving small problems that you could have solved easily instead of avoiding. Even more importantly, it's not like you can really avoid these problems forever - even if we always run JMX on a docker without Java 17, one day one user will want run JMX manually and encounter this problem and we will then need to rush to solve it. The irony is that we already did solve it (that select-java thing by @tchaikov), but docker and similar "let's all install exactly the same thing" philosophies just hid the problem (and its solution) from dtest developers.

mykaul commented 1 year ago

I've worked on a project that delivered 120+ Docker containers (OpenStack) and while it wasn't without faults, it was far easier to deliver and work on, on multiple platforms, than other solutions. I think we should look at our users first, then the developer experience (that can be vastly improved btw, using the right tools - some of which already exist, some need to be developed).

I honestly don't want to support 'user running JMX manually' use case. I'm trying to reduce the scope, not expand it.

nyh commented 1 year ago

But @mykaul, our customers don't use ccm or dtest. This discussion isn't about users (customers), it is about how to make R&D (core developers and QA) more productive and improve our test coverage and overall product quality - not how to make things convenient for end-users (for those, the current fashion is SaaS, not even docker). Developers need to run their own versions of Scylla and/or dtest while they new tests and/or new Scylla code to be tested. Wrapping up everything in docker makes everything harder to do, not easier.

The best, in my opinion, is always to allow both options. I'm not talking about spending months on developing or testing two options - just about fixing small usability problems when they happen once in a blue moon (like we did today) to allow each developer to use the most convenient approach for them.