scylla_cluster: fix handling of wait_other_notice

nyh commented 1 year ago

After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request.

For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail.

The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining.

This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable.

Fixes #461

fruch commented 1 year ago

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

nyh commented 1 year ago

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

Do we run dtests also on Cassandra? If we do, my new function won't work there :-(

fruch commented 1 year ago

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

Do we run dtests also on Cassandra? If we do, my new function won't work there :-(

we could run with cassnadra, but we don't do that regularly

we could overload watch_log_for_alive to always go to watch_rest_for_alive if it's scylla. and then test would still work with no change needed

nyh commented 1 year ago

Pushed new version with that "_" thing. Didn't change the sleep from 0.1 seconds - I can change it to anything, but I don't know what to pick...

nyh commented 1 year ago

Pushed a new version, the REST API now also waits for nodes to know other nodes' tokens. With this patch, when test is modified to disabled "wait for gossip to settle", we're down from 170 failures when I started to "just" 38.

nyh commented 1 year ago

Ping.

fruch commented 1 year ago

@bhalevy, as whom refactored it a few time, care to also give it a look ?

fruch commented 1 year ago

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.

And then we could merge it

mykaul commented 1 year ago

@eliransin - do you need to review this to get this moving forward?

fruch commented 1 year ago

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.

And then we could merge it

running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them) https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/

fruch commented 1 year ago

@eliransin - do you need to review this to get this moving forward?

it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0)

nyh commented 1 year ago

pushed new version - rebased, and changed {} to set() as @bhalevy noticed.

nyh commented 1 year ago

@eliransin - do you need to review this to get this moving forward?

it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0)

I think you did this above, but I don't know how to read the results (if I understood correctly, we can't expect zero failures?).

fruch commented 1 year ago

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it. And then we could merge it

running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them) https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/

@nyh only a handful of failures

some are timing out the 2min timeout of currently introduced wait function. I'm running now only the gating with it: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/315/

fruch commented 1 year ago

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/junit/bootstrap_test/TestBootstrap/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split018___test_seeds_on_duty/

this test expect boot to failed, test should enable wait notice others

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/replace_address_test/TestReplaceAddress/Run_Dtest_Parallel_Cloud_Machines___HeavyDtest___heavy_split001___test_replace_node_diff_ip_take_write_use_host_id_rbo_disabled_/

fails a bit like failures in debug runs...

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/compaction_additional_test/TestTimeWindowDataSegregation/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split006___test_twcs_multiple_sstables_during_bootstrap/

missing data in node3, doesn't fail like that on master runs

failing as in master: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/repair_based_node_operations_test/TestRepairBasedNodeOperations/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split019___test_ignore_dead_nodes_for_replace_option_2/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/compaction_additional_test/TestTimeWindowDataSegregation/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split021___test_enable_disable_optimized_query_for_twcs/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_decommission_after_changing_node_ip_2/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/cqlsh_tests.cqlsh_copy_tests/TestCqlshCopy/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_copy_to_with_child_process_crashing/

fruch commented 1 year ago

@nyh gating passed o.k.

so we are left with 3 tests that seems to regress (and we didn't tested enterprise features, even that most of them are single node)

@bhalevy @eliransin, it's your call if we are o.k. with fixing those tests later, this should be block until someone would be able to attend to those ?

bhalevy commented 1 year ago

I don't see how it would be ok to introduce regressions

nyh commented 1 year ago

I don't see how it would be ok to introduce regressions

It's not ok, but to be honest, I don't have the peace of mind to fix random buggy dtests while I keep getting requests to do other things at top priority, so I'm not going to fix this any time soon.

roydahan commented 1 year ago

I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/

The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.

However, this PR by itself won't improve runtime and possibly even degrade runtime.

nyh commented 1 year ago

I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/

The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.

Thanks.

However, this PR by itself won't improve runtime and possibly even degrade runtime.

The idea is to have a dtest PR which runs Scylla without huge sleeps. The branch is https://github.com/nyh/scylla-dtest/commits/quick-boot-2. There is a one-line patch https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c plus one fix for one test's bug - https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c. There will need to be more patches because a few dozen tests still fail (not as many as before the ccm patches, and but still, a non-zero list).

The next step can be to either make the dtest patch an optional parameter (so developers can use it), but @fruch and @eliransin suggested that it can be the default, and just add a tag for tests that can't work in this "sleepless" mode and only they will be run in the slow mode.

fruch commented 1 year ago

I'm handling the tests that need to be fixed, looks like it's identical in all 3 tests, we need to not wait_other_notice, and it solves the issue.

so this last run had 8 failure, 6 of them would be solved by https://github.com/scylladb/scylla-dtest/pull/3313. and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)

fruch commented 1 year ago

I'm handling the tests that need to be fixed, looks like it's identical in all 3 tests, we need to not wait_other_notice, and it solves the issue.

so this last run had 8 failure, 6 of them would be solved by scylladb/scylla-dtest#3313. and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)

I was skeptical, so went to check: https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/306/testReport/

for the last 3 runs, we have only 2 failing tests

fruch commented 1 year ago

run with the fixes of https://github.com/scylladb/scylla-dtest/pull/3313 in: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/317

fruch commented 1 year ago

one more test that need to be addressed: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/317/testReport/throughput_limits_test/TestStreamingLimitThroughput/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split005___test_limit_rbno_bootstrap_throughput/

kbr-scylla commented 4 months ago

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

2 nodes A, B
kill node A
start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (https://github.com/scylladb/scylla-dtest/pull/3991)...

temichus commented 4 months ago

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

2 nodes A, B

kill node A

start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...

@kbr-scylla

I've created PR where the issue can be reproduced https://github.com/scylladb/scylla-dtest/pull/4042/files approx with 500 runs, we should get at least 1 error.

bhalevy commented 4 months ago

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

2 nodes A, B

kill node A

start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...

kbr-scylla commented 4 months ago

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

Which is only a workaround for the problem, the implementation of wait_other_notice is still incorrect

bhalevy commented 4 months ago

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

Which is only a workaround for the problem, the implementation of wait_other_notice is still incorrect

Then it needs to be fixed. Issue nr?

kbr-scylla commented 4 months ago

I opened https://github.com/scylladb/scylla-ccm/issues/563

scylladb / scylla-ccm

scylla_cluster: fix handling of wait_other_notice #462