scylla_cluster: fix handling of wait_other_notice

nyh commented 1 year ago

After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are UP, otherwise the user sends a CL=ALL request to one node, it might not be aware that the other replicas are alive, and fail the request.

For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation seems to have a bug: When ccm starts the Nth node, instead of waiting for all nodes 1..N-1 to learn that N is up, it only waits for the nodes among 1..N-1 which it knows are up (node.status == Status.UP). But this information appears to be delayed, so we often end up not waiting for some of the nodes to learn about node N!

The solution I propose is here simple: I note that the cluster-starting code already waits for each node to become UP before continuing to the next one. So after we waited for the node to become UP, let's mark it as UP immediately, without waiting for some background operation to notice it. This will allow the next iteration to realize that when we start the next node, the previous node is surely one of the nodes that will need to wait for it.

This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are UP.

nyh commented 1 year ago

@fruch, this CCM patch appears to be needed for the dtest effort of booting Scylla faster (https://github.com/scylladb/scylla-dtest/issues/2429), to prevent some tests from becoming flaky:

An example is the dtest consistency_test.py::TestConsistency::test_short_read_delete, which before this CCM patch seems to fail about half the time, and after this patch seems to pass all the time (I tried a few dozen times, and they all passed).

The problem causing this test to fail is that it starts a cluster with wait_other_to_notice=True so expecting that all nodes know that all nodes are up, and then immediately sends an RF=ALL request. But because of the CCM bug (which I fixed here), the cluster claims it finished being created while some nodes do NOT know the other nodes are up, so the RF=ALL request can fail because the coordinator thinks other nodes are down.

My patch is trivial, and seems to work, but I'm very worried that I don't understand the code that I fixed, so the fix might not be the best one or might not be 100% correct, so I need someone familiar with this code to review it. In particular, I found several similar versions of this code that I fixed throughout the CCM code and fixed the only one that seem to be called from dtest. I hope it's enough.

nyh commented 1 year ago

@fruch I need your help. It seems my patch fails some ccm unit test run by the CI, but I don't understand if the failure is really related to my change or a pre-existing condition, and don't know how to debug it :-(

fruch commented 1 year ago

@fruch I need your help. It seems my patch fails some ccm unit test run by the CI, but I don't understand if the failure is really related to my change or a pre-existing condition, and don't know how to debug it :-(

I'll look at CCM CI, meanwhile we can do a sweep with this fix in the job you were using to make sure it doesn't break anything else (it would be enough to prove it's correct)

fruch commented 1 year ago

@fruch I need your help. It seems my patch fails some ccm unit test run by the CI, but I don't understand if the failure is really related to my change or a pre-existing condition, and don't know how to debug it :-(

you can run it with:

 pytest tests/ -k test_ccm_status[ccm_reloc_cluster]

anyhow I understand the issue, the status is updated only in memory. and when CCM is use as commandline the next call, i.e. ccm status didn't know of this change, since it's reading the status from node.conf file.

so this change would work o.k. for dtest, but might hurt some other users driver test that uses ccm from commandline (java, ccp)

I'll look into a fix

fruch commented 1 year ago

@nyh

can you confirm if this change also fix the issue you had with test_short_read_delete ?

diff --git a/ccmlib/scylla_cluster.py b/ccmlib/scylla_cluster.py
index 4d99a29..633c625 100644
--- a/ccmlib/scylla_cluster.py
+++ b/ccmlib/scylla_cluster.py
@@ -117,11 +117,11 @@ class ScyllaCluster(Cluster):
         for node in nodes:
             if not node.is_running():
                 if started:
-                    last_node, _, last_mark = started[-1]
+                    last_node, last_process, last_mark = started[-1]
                     last_node.watch_log_for("node is now in normal status|Starting listening for CQL clients",
                                             verbose=verbose, from_mark=last_mark,
                                             process=last_node._process_scylla)
-                    last_node.status = Status.UP
+                    last_node._update_pid(last_process)
                 mark = 0
                 if os.path.exists(node.logfilename()):
                     mark = node.mark_log()

nyh commented 1 year ago

Yes, it seems to work too, thanks.

nyh commented 1 year ago

Yes, it seems to work too, thanks.

Hmm, I spoke to soon! :-( I ran it a few more times and saw it fail twice :-( So no, it doesn't help...

fruch commented 1 year ago

Yes, it seems to work too, thanks.

Hmm, I spoke to soon! :-( I ran it a few more times and saw it fail twice :-( So no, it doesn't help...

I've tried also only your fix, and it doesn't seem to be enough, I'm not sure it's a bug in CCM per se. but that CCM isn't aware of some thing it would need to wait for.

from a log of node1 in that test:

INFO  2023-05-25 11:06:58,859 [shard 0] gossip - InetAddress 127.0.66.2 is now UP, status = UNKNOWN
INFO  2023-05-25 11:06:58,868 [shard 0] compaction - [Compact system.scylla_local 1f4f2f40-fad3-11ed-a0f6-aeff6c913dc2] Compacting [/home/fruch/.dtest/dtest-fssejtto/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-12-big-Data.db:level=0:origin=memtable,/home/fruch/.dtest/dtest-fssejtto/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-10-big-Data.db:level=0:origin=compaction]
INFO  2023-05-25 11:06:58,880 [shard 0] compaction - [Compact system.scylla_local 1f4f2f40-fad3-11ed-a0f6-aeff6c913dc2] Compacted 2 sstables to [/home/fruch/.dtest/dtest-fssejtto/test/node1/data/system/scylla_local-2972ec7ffb2038ddaac1d876f2e3fcbd/me-14-big-Data.db:level=0]. 11kB to 6kB (~51% of original) in 9ms = 680kB/s. ~256 total partitions merged to 1.
INFO  2023-05-25 11:06:59,689 [shard 0] migration_manager - Requesting schema pull from 127.0.66.2:0
INFO  2023-05-25 11:06:59,689 [shard 0] migration_manager - Pulling schema from 127.0.66.2:0
INFO  2023-05-25 11:06:59,697 [shard 0] schema_tables - Altering system_distributed.cdc_streams_descriptions_v2 id=0bf73fd7-65b2-36b0-85e5-658131d5df36 version=fb523305-9bdb-366e-bc1c-50f6c33145a4
INFO  2023-05-25 11:06:59,697 [shard 0] schema_tables - Altering system_distributed_everywhere.cdc_generation_descriptions_v2 id=234d2227-dd63-3d37-ac5f-c013e2ea9e6e version=590a2766-0667-3dab-b70f-91e8f6dc6fe8
INFO  2023-05-25 11:06:59,698 [shard 0] schema_tables - Altering system_distributed.view_build_status id=5582b59f-8e4e-35e1-b913-3acada51eb04 version=029e6c7b-a0d1-394c-adf0-7fb77bda46e7
INFO  2023-05-25 11:06:59,698 [shard 0] schema_tables - Altering system_distributed.service_levels id=b8c556bd-212d-37ad-9484-690c73a5994b version=e9e2e7ab-be2a-3a46-a98d-99b2a970ad8f
INFO  2023-05-25 11:06:59,698 [shard 0] schema_tables - Altering system_distributed.cdc_generation_timestamps id=fdf455c4-cfec-3e00-9719-d7a45436c89d version=ea3340b4-3f54-32eb-84db-babb135e8168
INFO  2023-05-25 11:06:59,698 [shard 0] query_processor - Column definitions for system_distributed.cdc_streams_descriptions_v2 changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,698 [shard 0] query_processor - Column definitions for system_distributed_everywhere.cdc_generation_descriptions_v2 changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,698 [shard 0] query_processor - Column definitions for system_distributed.view_build_status changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,698 [shard 0] query_processor - Column definitions for system_distributed.service_levels changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,698 [shard 0] query_processor - Column definitions for system_distributed.cdc_generation_timestamps changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,699 [shard 1] query_processor - Column definitions for system_distributed.cdc_streams_descriptions_v2 changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,699 [shard 1] query_processor - Column definitions for system_distributed_everywhere.cdc_generation_descriptions_v2 changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,699 [shard 1] query_processor - Column definitions for system_distributed.view_build_status changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,699 [shard 1] query_processor - Column definitions for system_distributed.service_levels changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,699 [shard 1] query_processor - Column definitions for system_distributed.cdc_generation_timestamps changed, invalidating related prepared statements
INFO  2023-05-25 11:06:59,702 [shard 0] schema_tables - Schema version changed to fdd2e2c5-9854-3de5-8bba-3903f206465d
INFO  2023-05-25 11:06:59,702 [shard 0] migration_manager - Schema merge with 127.0.66.2:0 completed

ccm looks for gossip - InetAddress 127.0.66.2 is now UP and assume that nodes knows each other, but the scheme pull take for each other takes one more second, I'll try to add code to wait for the schema pull.

assuming that's the only extra thing we need to wait for.

nyh commented 1 year ago

Yes, it seems to work too, thanks.

Hmm, I spoke to soon! :-( I ran it a few more times and saw it fail twice :-( So no, it doesn't help...

I've tried also only your fix, and it doesn't seem to be enough,

For the dtest I was running it seemed to be enough - I ran it dozens of times and it passed.

I'm not sure it's a bug in CCM per se. but that CCM isn't aware of some thing it would need to wait for.

As I tried to explain in my commit, that is a CCM bug. If CCM returns to the user before the nodes are aware that the rest are up (that's the whole point of the wait_others_notice parameter!) the cluster doesn't work with CL=ALL request. So not waiting for the others to notice despite this parameter is a CCM bug.

ccm looks for gossip - InetAddress 127.0.66.2 is now UP and assume that nodes knows each other, but the scheme pull take for each other takes one more second, I'll try to add code to wait for the schema pull.

Maybe schema pull is another bug that will eventually surface and need to be fixed, but I think it is orthogonal to the bug I discovered here. The bug I discovered here is that although looking at the code it seems that ccm when starting node B after node A looks for "gossip - InetAddress B is now UP" on A's log, it doesn't actually do this check in practice. The problem I saw is when it wants to do that, it still incorrectly believes that A isn't yet up so it does not inspect A's log! I wrote "incorrectly believes" because the cluster start loop (where my patch and yours go) waits to see in A's log that it becomes up, so this code does know that A is up - it can't think that A isn't yet up.

assuming that's the only extra thing we need to wait for.

CCM is probably riddled with bugs. It has so much code duplication that I don't even know what code actually runs ;-) But I think we'll need to fix them one at a time...

I admit I don't know if my fix is the correct one (is setting UP in memory the right thing? what's that "disk file" you discovered and who sets it and when to UP?) but in practice, when applying my patch for zero-sleep Scylla and then running the test consistency_test.py::TestConsistency::test_short_read_delete, from a 50% failure rate it went down to 0% failure rate.

(this specific test was simply the first one on the list of 176 failing tests... I have no idea how many other tests this fix will help, if at all...).

fruch commented 1 year ago

Yes, it seems to work too, thanks.

Hmm, I spoke to soon! :-( I ran it a few more times and saw it fail twice :-( So no, it doesn't help...

I've tried also only your fix, and it doesn't seem to be enough,

For the dtest I was running it seemed to be enough - I ran it dozens of times and it passed.

I'm not sure it's a bug in CCM per se. but that CCM isn't aware of some thing it would need to wait for.

As I tried to explain in my commit, that is a CCM bug. If CCM returns to the user before the nodes are aware that the rest are up (that's the whole point of the wait_others_notice parameter!) the cluster doesn't work with CL=ALL request. So not waiting for the others to notice despite this parameter is a CCM bug.

ccm looks for gossip - InetAddress 127.0.66.2 is now UP and assume that nodes knows each other, but the scheme pull take for each other takes one more second, I'll try to add code to wait for the schema pull.

Maybe schema pull is another bug that will eventually surface and need to be fixed, but I think it is orthogonal to the bug I discovered here. The bug I discovered here is that although looking at the code it seems that ccm when starting node B after node A looks for "gossip - InetAddress B is now UP" on A's log, it doesn't actually do this check in practice. The problem I saw is when it wants to do that, it still incorrectly believes that A isn't yet up so it does not inspect A's log! I wrote "incorrectly believes" because the cluster start loop (where my patch and yours go) waits to see in A's log that it becomes up, so this code does know that A is up - it can't think that A isn't yet up.

I can confirm that this is an issue in CCM, but fix you suggested doesn't solve the issue I'll prepare a different fix for it

assuming that's the only extra thing we need to wait for.

CCM is probably riddled with bugs. It has so much code duplication that I don't even know what code actually runs ;-) But I think we'll need to fix them one at a time...

all code is riddled with bug :)

I admit I don't know if my fix is the correct one (is setting UP in memory the right thing? what's that "disk file" you discovered and who sets it and when to UP?) but in practice, when applying my patch for zero-sleep Scylla and then running the test consistency_test.py::TestConsistency::test_short_read_delete, from a 50% failure rate it went down to 0% failure rate.

when running it with your fix only, I don't %0 failure, I get much more failure then 0

(this specific test was simply the first one on the list of 176 failing tests... I have no idea how many other tests this fix will help, if at all...).

fruch commented 1 year ago

@nyh

this achives the same, as the fix you suggested here: https://github.com/scylladb/scylla-ccm/pull/460

nodes are being checked correctly in wait_other_notice both in cluster level and in node level. i.e. there are two checks happening to make sure we don't missing it. i.e. cluster.start vs. node.start

and both works as expect, while the test test_short_read_delete still not stable, we still need to know what else we should be waiting for....

nyh commented 1 year ago

Closing this PR. As @fruch rightly noticed, my patch solved a bug in the wrong way, and it didn't actually solve the right bug. @fruch and I will post two other patches instead of this one.

scylladb / scylla-ccm

scylla_cluster: fix handling of wait_other_notice #459