riptano / ccm

A script to easily create and destroy an Apache Cassandra cluster on localhost
Apache License 2.0
1.21k stars 299 forks source link

Fix java version selection 2 #767

Closed jacek-lewandowski closed 1 month ago

michaelsembwever commented 1 month ago

@jacek-lewandowski don't forget to move the cassandra-tag

follow the instructions from CASSANDRA-16383

git tag -a -f cassandra-test
git push origin :refs/tags/cassandra-test
git push -f origin --tags

CI results were ccm_pull_767_69_ci.zip

aweisberg commented 1 month ago

Thanks for tackling upgrade test issues :-)

I think this might not work for all the tests.

Seems like it didn't find the correct java version for 2.2.19

Is this something that reproduces in your environment?

upgrade_tests/upgrade_through_versions_test.py::TestProtoV3Upgrade_AllVersions_EndsAt_Trunk_HEAD::test_parallel_upgrade_with_internode_ssl
-------------------------------- live log setup --------------------------------
15:00:24,516 conftest INFO Starting execution of test_parallel_upgrade_with_internode_ssl at 2024-05-13 15:00:24.516770
15:00:24,517 dtest_setup INFO cluster ccm directory: /tmp/dtest-wg2_r_8i
-------------------------------- live log call ---------------------------------
15:00:24,614 ccm INFO Downloading https://intentionally-removed/apache-dist-cache/cassandra/2.2.19/apache-cassandra-2.2.19-bin.tar.gz to /tmp/ccm-n1r0r_p0.tar.gz (28.911MB)
15:00:24,699 ccm INFO Extracting /tmp/ccm-n1r0r_p0.tar.gz as version 2.2.19 ...
15:00:27,533 ccm INFO Supported Java versions for Cassandra distribution in '/parallel-ci/work/.ccm/repository/2.2.19': None
15:00:27,629 ccm INFO node1: Using the current Java version 11 available on PATH for the current invocation of Cassandra 2.2.19.
15:00:27,630 ccm WARNING node1: The current Java version 11 is not supported by Cassandra 2.2.19 (supported versions: [8]).
15:00:27,721 ccm INFO Supported Java versions for Cassandra distribution in '/parallel-ci/work/.ccm/repository/2.2.19': None
15:00:27,810 ccm INFO node1: Using the current Java version 11 available on PATH for the current invocation of Cassandra 2.2.19.
15:00:27,810 ccm WARNING node1: The current Java version 11 is not supported by Cassandra 2.2.19 (supported versions: [8]).
15:00:27,853 ccm INFO Starting node1 with JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 java_version=11 cassandra_version=2.2.19, install_dir=/parallel-ci/work/.ccm/repository/2.2.19
michaelsembwever commented 1 month ago

In ci-cassandra.a.o and circleci we're skipping tests so we didn't detect this. On 4.0

21:02:54 19:02:54,486 upgrade_tests.upgrade_manifest INFO Skipping version github:apache/cassandra-5.0 because it requires JDK (11, 17). Current JDK is 8 and none of ['JAVA11_HOME', 'JAVA17_HOME'] env variables are defined.
21:02:54 19:02:54,486 upgrade_tests.upgrade_manifest INFO Skipping version github:apache/trunk because it requires JDK (11, 17). Current JDK is 8 and none of ['JAVA11_HOME', 'JAVA17_HOME'] env variables are defined.

( https://ci-cassandra.apache.org/job/Cassandra-4.0-dtest-upgrade/643/ )

On 5.0

[2024-05-09T18:54:50.415Z] 18:54:49,165 upgrade_tests.upgrade_manifest INFO Skipping version 2.2.19 because it requires JDK (7, 8). Current JDK is 11 and none of ['JAVA7_HOME', 'JAVA8_HOME'] env variables are defined.
[2024-05-09T18:54:50.415Z] 18:54:49,165 upgrade_tests.upgrade_manifest INFO Skipping version 3.0.30 because it requires JDK (8,). Current JDK is 11 and none of ['JAVA8_HOME'] env variables are defined.
[2024-05-09T18:54:50.415Z] 18:54:49,166 upgrade_tests.upgrade_manifest INFO Skipping version 3.11.17 because it requires JDK (8,). Current JDK is 11 and none of ['JAVA8_HOME'] env variables are defined.

( https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-5.0/detail/Cassandra-5.0/233/pipeline/1180 )

back history is CASSANDRA-18106

(separately, i think I made a mistake here not opening a CASSANDRA issue for this work, given it's broad potential impact)

Adding back jdk switching (adding back the JAVAx_HOME variables) for just these tests, in circleci and ci-cassandra.a.o, is slated as a "later" task. But not having them meant we failed to catch the breakage @aweisberg reports.

jacek-lewandowski commented 1 month ago

There is no other way to run upgrade through versions than having JAVAx_HOME variables. CCM is works now in a predictable way - uses current Java or searches for explicitly provided Java version across JAVAx_HOME directories.

CCM job is not to select a Java version - CCM is just a tool and should do whatever its user tells to do. So, in order to run upgrade through versions tests we need to ensure those env variables are set properly and dtest pass the Java version we want to use for running a node - dtests are aware of which Java version is supported as it is defined in upgrade manifest.

aweisberg commented 1 month ago

@jacek-lewandowski I think you misunderstand. This is broken when you specify JAVAX_HOME because CCM no longer determines the version of the JVM that C* needs 15:00:27,533 ccm INFO Supported Java versions for Cassandra distribution in '/parallel-ci/work/.ccm/repository/2.2.19': None and so it defaults to whatever is provided in the PATH. This works when all Cassandra versions being tested can use the same JDK version like 11 which can be used from 4.0-trunk, but stops working as soon as you need to use a version that isn't compatible with 11.

RE CCM behavior. These changes are a regression to how CCM works (and CCM is not just used for dtests) as CCM used to correctly select a compatible JDK if the provided one was wasn't compatible and now it does not. This makes CCM harder to use for a variety of use cases and forces the logic for detecting the required JVM version to be duplicated outside CCM.

I think it would have been better if this was a flag, or at least auto-selecting a JDK was a flag so other usages continue to be easy and not require manual munging of environment variables when really no one cares which JDK is used as long as it is compatible.

jacek-lewandowski commented 1 month ago

That was the initial solution which was rejected by @michaelsembwever

aweisberg commented 1 month ago

Which is it though?

I also hear that JAVAX_HOME isn’t in use anymore, and the user of CCM (dtests or otherwise) should set nothing except JAVA_HOME and that upgrading across JDK versions is now unsupported.

There is no other way to run upgrade through versions than having JAVAx_HOME variables. CCM is works now in a predictable way - uses current Java or searches for explicitly provided Java version across JAVAx_HOME directories.

OK so you are saying that CCM is supposed to pick the correct version of the JDK here, but that isn’t what this code does. JAVA8_HOME does not work in versions < 5.0. Why should using JAVAX_HOME work with versions great than 5.0, but regress for versions < 5.0?

Why does Apache CI specify JAVA11_HOME and JAVA17_HOME? Why does CCM still contain code that will look up the appropriate JAVAX_HOME version if only it could actually correctly detect the version required by Cassandra?

From the outside this looks like we ended with a random smattering of approaches mashed together that break a whole bunch of ways of testing that used to work. You used to be able to test 4.0/4.1 with JDK 11 which is the JDK that most people run in production and now you can’t.

jacek-lewandowski commented 1 month ago

@aweisberg the first commit was https://github.com/riptano/ccm/commit/b3a3f98b8541e60d1cae9f424bf6d49e01d14f99 which added --jvm-version args and worked with the assumption that the CCM is responsible for selecting the right JVM for running tests. Then @michaelsembwever in a Slack conversation said that CCM should be contextual and use JAVA_HOME/current Java on the path.

Now I'm lost. The main motivation was that while having j8 on the PATH and having JAVA11_HOME, I couldn't run Cassandra 5.

michaelsembwever commented 1 month ago

I also hear that JAVAX_HOME isn’t in use anymore, and the user of CCM (dtests or otherwise) should set nothing except JAVA_HOME and that upgrading across JDK versions is now unsupported. …

This patch was intended to simplify a number of things:

Leaving it up DTests to do the jdk switching makes sense, as it has more information about supported jdks across C* upgrade paths.

The use of JAVAx_HOME should be reserved only for when multi-upgrade dtests (not to be confused with normal upgrade tests – different terminology here) is to include upgrade steps that require jdk switching.

The full possibility of multi-upgrade paths with jdks from 3.0 to 5.0 is…

2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j8 -> 4.1/j8 -> 5.0/j11
2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j8 -> 4.1/j11 -> 5.0/j11
2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j11 -> 4.1/j11 -> 5.0/j11
2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j8 -> 4.1/j8 -> 5.0/j17
2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j8 -> 4.1/j8 -> 5.0/j17
2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j8 -> 4.1/j11 -> 5.0/j17
2.2/j8 -> 3.0/j8 -> 3.11/j8 -> 4.0/j11 -> 4.1/j11 -> 5.0/j17

This should be possible with the use of JAVAx_HOME and how the path+java_home is set to define the preferred jdk. That is, It should be possible to

if path and java_home are set, only switch to a javaX_home if path and java_home is a not supported jdk

This is currently broken (by this patch).

And this logic should be in dtest, or the testing code, not within CCM. (A known limitation here is that only one preferred jdk can be specified, so with longer paths that involve four or more jdks we couldn't state two different preferred jdks.)

michaelsembwever commented 1 month ago

A separate discussion is also around the value of these multi-upgrade tests and jdk-switching. They are testing faults in serialisation that don't appear until two major version upgrades later. We have unit tests for this already (those that use the test/data/serialization/ resources), so these multi-upgrade tests are smoke tests to catch what the existing serialisation tests don't. Today neither circleci nor the jenkinsfile runs these smoke tests with jdk-switching (so their upgrade paths are bound to what can be done with the env's current jdk – i.e. JAVAx_HOME is not set anywhere).

aweisberg commented 1 month ago
The use of JAVAx_HOME should be reserved only for when multi-upgrade dtests (not to be confused with normal upgrade tests – different terminology here) is to include upgrade steps that require jdk switching.

I really feel lIke we got a bad deal here moving complexity from A to B where B is higher up the stack and this negatively impacts other users of A that also now need to duplicate that complexity.

If what we want is better management of preferred versions it seems like that is the thing to add.

jmckenzie-dev commented 1 month ago

I want to call out, as kindly and respectfully as possible because I'm really fond of everyone here, how disappointing it is to see a flub like this. Not the bug / technical aspect; we all do that all the time. The lack of a JIRA ticket for this or notification to the project that a change was happening deep in the guts of how CCM works and determines version numbers. A dev ML thread. Something other than a slack thread 😄

It strikes me that cassandra-dtest and ccm still existing in this weird non-ASF limbo probably doesn't help. But anyway, at the end of the day, let's keep trying our best to make sure that any changes to CCM or cassandra-dtests's deeper bowels are communicated with the project at large. This is probably only something that effects those of us running the only other CI env I'm aware of (for now), but when we can start getting more env onto that shared JenkinsFile the blast radius and risks of changes like this are only going to widen.

michaelsembwever commented 1 month ago

I really feel lIke we got a bad deal here moving complexity from A to B where B is higher up the stack and this negatively impacts other users of A that also now need to duplicate that complexity.

Yes, that this functionality was being used actively comes as a surprise :(

If what we want is better management of preferred versions it seems like that is the thing to add.

Given the low-maintenance mode of ccm, I'd rather not rush into this. We do want to donate ccm to the project, but still the use of it (at least for python dtests) seems to be stale/reducing… (in favour of the jvm dtests).

I know it's been a more significant change than expected, and should have been better communicated (jira issue and deprecation cycle), nonetheless it would still be great IMHO to see jdk-switching be a capability that only belongs to the testing code and not to the cluster provisioning tool…

(A key motivating factor here has been dealing with the number of changes and fixes required everytime a new C* version changes its jdk versions.)

This is probably only something that effects those of us running the only other CI env I'm aware of (for now)…

Being practical. We can roll back these changes if need be (and that can be initially quickly done just by moving the cassandra-test tag back to where it was).

But out of curiosity, what's actually a possible fix here to move forward…? My understanding is that it's now only a simple fix that's required in the cassandra-dtest code …?

I will create the jira ticket, post-mortem.

aweisberg commented 1 month ago

I want to be clear when I say other users of ccm I don’t mean people I mean use cases like local testing or other test harnesses that leverage CCM (not CI). Could have picked better terminology.

I am not very concerned about a little breakage happening here or there (glass houses) just because it’s a fact of life when things are loosely coupled and not tested along with CCM.

aweisberg commented 1 month ago

I am almost through updating CI so that it doesn’t run the upgrade paths that don’t work anymore so I don’t think we need to roll back at this point.

jacek-lewandowski commented 1 month ago

I don't know any place to create a ticket for CCM - could you @jmckenzie-dev or @michaelsembwever create a new component in Cassandra Jira project - CCM?

I'm fine with either approach as long as it is consistent:

  1. CCM is/can be responsible for JVM selection - in this case we should remove supported Java versions from the manifest (VersionMeta) and let Dtests retrieve them using a CCM lib method, the same which CCM uses to automatically select Java version.

  2. The caller (user/dtest) is responsible for JVM selection. In this case we can remove the Java selection code entirely from CCM (as it is implemented right now, it uses that approach as a last resort fallback, when there is no JAVA_HOME, no java command available on the path, and no Java version specified explicitly.

I have a patch for dtest to use the current CCM impl but I'm ready to implement whatever is needed. My goal is to:

Let's decide the whether we want to go 1 or 2 and I'll fix it.

michaelsembwever commented 1 month ago

(2) please.

jacek-lewandowski commented 1 month ago

I've realised that (2) is logically problematic since things like https://github.com/riptano/ccm/blob/d74db63d75112908a77b6c80757df9343fdc3338/ccmlib/common.py#L62 where we actually make CCM do things implicitly depending on Cassandra version, which contradicts the assumptions of (2).

michaelsembwever commented 1 month ago

CCM_41_YAML_OPTIONS isn't related to jvm selection or switching.

Upgrading a node can be expected to upgrade files. We could introduce flags to modify this behaviour, but I don't put it in the same category as changing the environment. The node owns its config and data files, and if you start a node up with a new version then it's to be expected these things are changed accordingly… Ideally CCM shouldn't have to provide this config compatibility, as C* can work with older config options, but IIUC because ccm was already doing stuff with config options the work in CASSANDRA-15234 needed to touch ccm too…

We don't provide configuration options (owned or contextual) to Cassandra for it to detect jdks. But it does have the ability to handle config versions.

This seems a reasonable definition between what is the callers responsibility and what ccm can be responsible for. I'd also ask we seperate discussion between jdk selection and config compatibilities, for the sake of simpler discussion and incremental work. e.g. i'm sure CCM_41_YAML_OPTIONS can be improved too, but that shouldn't block this work.

I'm also still open to just reverting these patches. The other patch is https://github.com/riptano/ccm/pull/766

I believe that original bug only needed --jvm-version arg added and a check that path and java_home env vars match.

michaelsembwever commented 1 month ago

@jacek-lewandowski created the jira ticket https://issues.apache.org/jira/browse/CASSANDRA-19636

jacek-lewandowski commented 1 month ago

https://github.com/riptano/ccm/pull/768 - revert of those two commits and a fix which does not change the logic (apart from minor fixes)