scylladb / scylla-ccm

Cassandra Cluster Manager, modified for Scylla
Apache License 2.0
20 stars 60 forks source link

scylla-node: prepare for the post-jmx world #565

Closed denesb closed 3 months ago

denesb commented 3 months ago

Soon the jmx package will be dropped from the unified package, and the scylla.git repo. Prepare for this:

Fixes: https://github.com/scylladb/scylla-dtest/issues/4093

denesb commented 3 months ago

We need a full dtest run with this PR, I don't know how to do that.

denesb commented 3 months ago

This PR is a 6.0 release blocker, so please give it attention.

tchaikov commented 3 months ago

We need a full dtest run with this PR, I don't know how to do that.

running at https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/

denesb commented 3 months ago

Looks like I have to fix ccm's own tests too.

tchaikov commented 3 months ago

Looks like I have to fix ccm's own tests too.

@denesb see #566

denesb commented 3 months ago

@denesb is the next step to update the building system of scylla and then drop tools/jmx submodule?

Yes, see https://github.com/scylladb/scylladb/pull/17969.

denesb commented 3 months ago

New in v2:

fruch commented 3 months ago

We need a full dtest run with this PR, I don't know how to do that.

running at https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/

This is just the gating (and seems like there is some work still needed, not sure if in this PR)

We should also do a run with enterprise

And a full run, to flush issues with non gating tests.

tchaikov commented 3 months ago

New in v2:

* Update package version in `tests/test_config.py`, to a version, which has the fully complete native nodetool.

but i think it would be great to continue using the cached binary as long as cache works.

We need a full dtest run with this PR, I don't know how to do that.

running at https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/

This is just the gating (and seems like there is some work still needed, not sure if in this PR)

We should also do a run with enterprise

And a full run, to flush issues with non gating tests.

ah, right. rescheduling at https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/212/

denesb commented 3 months ago

but i think it would be great to continue using the cached binary as long as cache works.

Not sure what do you mean, what cache?

denesb commented 3 months ago

I just discovered two new commands I didn't know before:

tchaikov commented 3 months ago

but i think it would be great to continue using the cached binary as long as cache works.

Not sure what do you mean, what cache?

IIUC, the github workflows try to reuse the cached binary. see https://github.com/scylladb/scylla-ccm/blob/903a2a65f70ebc9d1327c2f8af7159ef460664d7/.github/workflows/nix.yml#L26-L31 and https://github.com/scylladb/scylla-ccm/blob/903a2a65f70ebc9d1327c2f8af7159ef460664d7/.github/workflows/integration-tests.yml#L54-L72. in other words, i am trying to raise more attentions to #566 . i will apply the same change to integration-tests.yml once i get positive feedbacks.

denesb commented 3 months ago

Analysis of failed gating dtests. I will soon prepare scylla-dtest.git and scylla.git branches to rerun this with.

FullDtest / full-split003 / reversed_queries_test.TestReversedQueriesSelectorsDuringUpgrade.test_queries_during_upgrade

Version we upgrade to doesn't have feature complete nodetool. Need to change the test to use a more recent version.

[FullDtest / full-split001 / compaction_test.TestCompaction.test_compaction_delete_tombstone_gc[immediate-SizeTieredCompactionStrategy]](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/compaction_test/TestCompaction/FullDtest___full_split001___test_compaction_delete_tombstone_gc_immediate_SizeTieredCompactionStrategy_/)
[FullDtest / full-split001 / compaction_test.TestCompaction.test_compaction_delete_tombstone_gc[immediate-TimeWindowCompactionStrategy]](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/compaction_test/TestCompaction/

Flaky test (very possibly related to native nodetool somehow): https://github.com/scylladb/scylladb/issues/17947

FullDtest_fullsplit001test_compaction_delete_tombstone_gc_immediateTimeWindowCompactionStrategy/)

[FullDtest / full-split001 / nodetool_additional_test.TestNodetool.test_scrub_with_multi_nodes_expect_data_rebuild](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/nodetool_additional_test/TestNodetool/FullDtest___full_split001___test_scrub_with_multi_nodes_expect_data_rebuild/)

Missing getsstables command.

[FullDtest / full-split001 / update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/FullDtest___full_split001___test_change_node_ip/)

Version we upgrade to doesn't have feature complete nodetool. Need to change the test to use a more recent version.

[FullDtest / full-split002 / compaction_test.TestCompaction.test_compaction_delete_tombstone_gc[immediate-TimeWindowCompactionStrategy]](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/compaction_test/TestCompaction/

Flaky test (very possibly related to native nodetool somehow): https://github.com/scylladb/scylladb/issues/17947

FullDtest_fullsplit002test_compaction_delete_tombstone_gc_immediateTimeWindowCompactionStrategy/)

[FullDtest / full-split000 / nodetool_additional_test.TestNodetool.test_get_sstable](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/nodetool_additional_test/TestNodetool/FullDtest___full_split000___test_get_sstable/)

Need to implement getsstable command.

[FullDtest / full-split000 / nodetool_additional_test.TestNodetool.test_sstable_info](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/nodetool_additional_test/TestNodetool/FullDtest___full_split000___test_sstable_info/)

Need to implement sstableinfo command.

[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_one_keyspace_multiple_column_families_writes_sample_and_empty_reads_sample](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_one_keyspace_multiple_column_families_writes_sample_and_empty_reads_sample/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_all_column_families_writes_only](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_all_column_families_writes_only/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_writes_sample_for_10_partitions_with_100_op_and_empty_reads_sample](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_writes_sample_for_10_partitions_with_100_op_and_empty_reads_sample/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_all_column_families_reads_only](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_all_column_families_reads_only/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_reads_sample_for_10_partitions_with_100_op_and_empty_writes_sample](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_reads_sample_for_10_partitions_with_100_op_and_empty_writes_sample/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_top_5_paritions_write_samplers_and_empty_read_sample](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_top_5_paritions_write_samplers_and_empty_read_sample/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_top_5_paritions_for_read_samplers_and_empty_write_sample](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_top_5_paritions_for_read_samplers_and_empty_write_sample/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_top_3_paritions_for_write_samplers_only](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_top_3_paritions_for_write_samplers_only/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_top_3_paritions_for_read_samplers_only](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_top_3_paritions_for_read_samplers_only/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_param_sampler_writes_and_capacity_size](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_param_sampler_writes_and_capacity_size/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_param_sampler_read_and_capacity_size](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_param_sampler_read_and_capacity_size/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_write_into_one_paritions_to_different_rows](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_write_into_one_paritions_to_different_rows/)
[FullDtest / full-split000 / toppartitions_test.TestTopPartitions.test_one_keyspace_multiple_column_families_reads_sample_and_empty_writes_sample](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/211/testReport/junit/toppartitions_test/TestTopPartitions/FullDtest___full_split000___test_one_keyspace_multiple_column_families_reads_sample_and_empty_writes_sample/)

tools/toppartitions.py uses a regexp to search for running nodetool executable which doesn't match the native nodetool, regex needs minor adjustment

denesb commented 3 months ago

New in v3 - fix the tests:

denesb commented 3 months ago

I will have to roll back the removal of JMX, because it breaks upgrade tests. :(

denesb commented 3 months ago

New in v4:

denesb commented 3 months ago

Kicked off a new BYO: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/213/

fruch commented 3 months ago

@denesb

we are waiting for https://github.com/scylladb/scylladb/pull/17979 to be merged first, right ?

denesb commented 3 months ago

we are waiting for scylladb/scylladb#17979 to be merged first, right ?

Yes.

denesb commented 3 months ago

New in v4:

denesb commented 3 months ago

@tchaikov I'm confused as to why the integration is failing. I rebased on top of https://github.com/scylladb/scylla-ccm/pull/566. Do I also need to rebase on top of https://github.com/scylladb/scylla-ccm/pull/567?

tchaikov commented 3 months ago

@denesb hi Botond, i am sorry for this inconvenience. the test failure is indeed related the recent change to the nix workflow. in which, i set the SCYLLA_VERSION env var, see https://github.com/scylladb/scylla-ccm/blob/22283e1f59aa640b5e301395ead9b3dcf3ec0ff2/.github/workflows/nix.yml#L37

and in the test, we respect this env var over the setting hardwired in the code, see https://github.com/scylladb/scylla-ccm/blob/22283e1f59aa640b5e301395ead9b3dcf3ec0ff2/tests/test_config.py#L8-L9 .

so, if you want to bump up the scylla version to be tested, i'd suggest update workflows/nix.yml as well.

denesb commented 3 months ago

Ok, updated nix.yaml.

avikivity commented 3 months ago

It would be best to have ccm always use the relocatable package and install it, and just call nodetool without worrying if it's native or not.

denesb commented 3 months ago

It would be best to have ccm always use the relocatable package and install it, and just call nodetool without worrying if it's native or not.

It kinda has to know, because CCM is also the one taking care of scylla-jmx. And when using the native nodetool, you don't need JMX. Looking forward, having CCM "know" also allows us to drop all this JMX-related code at one point in the future.

tchaikov commented 3 months ago

also, i would prefer keeping the capability to test without reloc package. it's much more convenient for a faster development iteration to use the local build directory for testing.

denesb commented 3 months ago

Another advantage of CCM knowing about the native nodetool: currently (before this PR), ccm would pass the JMX port to the nodetool's -p parameter, so we had to add another sneaky parameter, which looks like a Java def, which is internally recognized and translated to --rest-api-port (see https://github.com/scylladb/scylladb/pull/17168/commits/c2425ca1354072ec733a369edd2402b33dadae2e). With CCM knowing which nodetool it is calling to, we can revert that commit, before release, so users do not have to wonder why there is a -p and --rest-api-port.

avikivity commented 3 months ago

It would be best to have ccm always use the relocatable package and install it, and just call nodetool without worrying if it's native or not.

It kinda has to know, because CCM is also the one taking care of scylla-jmx. And when using the native nodetool, you don't need JMX. Looking forward, having CCM "know" also allows us to drop all this JMX-related code at one point in the future.

Yes. In any case we don't want to drop ccm capabilities that we already have.

denesb commented 3 months ago

All dependencies are in, kicked off a new BYO job: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/220/

tchaikov commented 3 months ago

Ok, updated nix.yaml.

as #567 got merged, could you please update .github/workflows/integration-tests.yml as well.once #569 gets merged, it'd be simpler to update these versions -- in a single file.

denesb commented 3 months ago

All dependencies are in, kicked off a new BYO job: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/220/

Looks like I have to fill all the repo fields, I cannot leave them empty. Now a BYO which actually got beyond the checkout phase: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/dtest-byo/223/

denesb commented 3 months ago

Latest BYO:

FullDtest / full-split001 / nodetool_additional_test.TestNodetool.test_scrub_with_multi_nodes_expect_data_rebuild FullDtest / full-split001 / update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip FullDtest / full-split000 / nodetool_additional_test.TestNodetool.test_get_sstable FullDtest / full-split000 / nodetool_additional_test.TestNodetool.test_sstable_info

For some reason, BYO didn't pull the latest master, the sstableinfo and getsstables commands are still missing, even though they are present on master. The 4th test -- update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip -- is something else, I have to look into it.

denesb commented 3 months ago

FullDtest / full-split001 / update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip

This is another test which fails because the native nodetool is too fast. Adding a 5s sleep makes the test pass.

denesb commented 3 months ago

FullDtest / full-split001 / update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip

This is another test which fails because the native nodetool is too fast. Adding a 5s sleep makes the test pass.

https://github.com/scylladb/scylla-dtest/pull/4114

denesb commented 3 months ago

New BYO with build is running: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2165/

denesb commented 3 months ago

There is one more nodetool command missing: checkAndRepairCdcStreams.

denesb commented 3 months ago

Rebased again, @tchaikov please check that the workflow yamls look ok.

denesb commented 3 months ago

There is one more nodetool command missing: checkAndRepairCdcStreams.

PR: https://github.com/scylladb/scylladb/pull/18076

denesb commented 3 months ago

New BYO, including the two above PRs: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2166/

avikivity commented 3 months ago

FullDtest / full-split001 / update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip

This is another test which fails because the native nodetool is too fast.

Where's the compatibility?!

fruch commented 3 months ago

FullDtest / full-split001 / update_cluster_layout_tests.TestUpdateClusterLayout.test_change_node_ip

This is another test which fails because the native nodetool is too fast.

Where's the compatibility?!

yeah, they should have shelp a few sleeps in the new code 🤣

denesb commented 3 months ago

It would be best to have ccm always use the relocatable package and install it, and just call nodetool without worrying if it's native or not.

I may have to go back to this, I'm hitting issues in upgrade tests and I ended up restoring the pre-PR code completely, just invoked in an else: branch. I will prepare a Scylla branch which installs the native nodetool as nodetool and lets see if that works.

I may have to discard this PR completely. At least it served as a good integration test for finding the last wrinkles to iron out in the native nodetool.

I will report back when I have results.

denesb commented 3 months ago

New in v5:

With this, CCM should be prepared for the disappearance of the java tools. Lets see how dtests will take this (I tested with a small sample, including an upgrade test, which is tricky).

denesb commented 3 months ago

BYO (with java packages removed): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2168/

fruch commented 3 months ago

BYO (with java packages removed): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2168/

That's quite ambitious

There quite a lot of tests using Cassandra-stress, you'll need to supply an alternative or in CCM or in dtest

denesb commented 3 months ago

BYO (with java packages removed): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2168/

That's quite ambitious

There quite a lot of tests using Cassandra-stress, you'll need to supply an alternative or in CCM or in dtest

Yes, it was more for curiosity and also because dropping the java packages is one of our goals for 6.0 (but we of course want to keep c-s). There are 280 failed tests already.

I already kicked off another BYO run, which runs with the java tools still in (it is in the queue).

denesb commented 3 months ago

I will hack https://github.com/scylladb/scylladb/pull/17969 to install only cassandra-stress, but nothing else, then adjust this PR to work with that. This is the agree-upon end-state: only cassandra-stress is part of the unified package, jmx and the rest of the java-tools content is dropped.

fruch commented 3 months ago

I will hack https://github.com/scylladb/scylladb/pull/17969 to install only cassandra-stress, but nothing else, then adjust this PR to work with that. This is the agree-upon end-state: only cassandra-stress is part of the unified package, jmx and the rest of the java-tools content is dropped.

I thought @syuu1228 was doing some work on repacking c-s on itself own, maybe you want to sync with him

denesb commented 3 months ago

I will hack scylladb/scylladb#17969 to install only cassandra-stress, but nothing else, then adjust this PR to work with that. This is the agree-upon end-state: only cassandra-stress is part of the unified package, jmx and the rest of the java-tools content is dropped.

I thought @syuu1228 was doing some work on repacking c-s on itself own, maybe you want to sync with him

Yes, I will. For now I just want to see what changes I need here in ccm, to keep cassandra-stress working, but not care about the other java tools.

denesb commented 3 months ago

v6: reduce the scope of the PR from preparing for the post-java world, to preparing to the post-jmx world.