`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state

kbr-scylla commented 3 months ago

wait_other_notice in ccmlib/scylla_node.py currently works by looking at the /gossiper/live endpoint API https://github.com/scylladb/scylla-ccm/blob/bd519ed12cccd8186dedf8613656f21dd156354e/ccmlib/scylla_node.py#L334 https://github.com/scylladb/scylla-ccm/blob/bd519ed12cccd8186dedf8613656f21dd156354e/ccmlib/scylla_node.py#L1361

If other nodes see the node as UP, and the node sees other nodes as UP, the check will succeed.

But consider the following:

2 nodes A, B
kill node A
start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

So node B may still consider A as UP because it was UP before it was killed; B's failure detector just didn't react to that event yet.

That's why the old implementation in ccmlib/node.py used log markers, it watched other nodes' logs from a point after the node was restarted, effectively serializing the events.

This bug was introduced in https://github.com/scylladb/scylla-ccm/pull/462.

cc @temichus

kbr-scylla commented 3 months ago

We could fix it by combining the old behavior (with log markers) with the new behavior (watching REST). For example -- we could wait for the "is now UP" message first, then use watch_rest_for_alive.

Although, looking at Scylla code, I don't really understand why "is now UP" is not sufficient; it happens strictly after modifying live_endpoints set in gossiper, which is used by the /gossiper/live REST API. Perhaps it's necessary for some older versions of Scylla. And I see that watch_rest_for_alive contains additional logic for watching for NORMAL too, so that is still useful.

kbr-scylla commented 3 months ago

Example failure: https://github.com/scylladb/scylla-dtest/pull/4042/files

the test sometimes fails with

        # starting node1 - it should reconnect and run as is
        node1.start(wait_other_notice=True, wait_for_binary_proto=True)
        #  due to CL=1 it does not make sense to read while not all nodes are UP
        # for node in cluster.nodelist():
        #     wait_for_nodes_status(node, ["UN", "UN", "UN"])

>       result = list(session.execute("SELECT * FROM cf"))

update_cluster_layout_tests.py:1136: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cassandra/cluster.py:2726: in cassandra.cluster.Session.execute
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl ONE. Requires 1, alive 0" info={'consistency': 'ONE', 'required_replicas': 1, 'alive_replicas': 0}

(https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/artsiom_mishuta/job/dtest-pytest-gating/240/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/test_simple_kill_remained_node_while_decommissioning_96_100_/)

because we execute SELECT too fast. start returned too early, and then node 2 marked node 1 as DOWN, and then we executed SELECT before node2 marked node 1 as UP again.

nyh commented 3 months ago

But consider the following:
* 2 nodes A, B
* kill node A
* start node A with `wait_other_notice=True`
The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

I see that you and Benny liked https://github.com/scylladb/scylla-ccm/pull/564 so I'll merge it, but I want to point out that I think the real problem in your example above is that "kill node A" neglects to wait for node A to be killed. If you want to do "kill node A" followed in quick succession by "start node A", it doesn't make sense to not wait for node A to be killed before starting it again.

The benefit of https://github.com/scylladb/scylla-ccm/pull/564 is that it will allow the test to succeed despite the test bug that it forgets to wait for the "kill node A" to finish, so in that sense it is a good patch because it makes tests less easy to write incorrectly.

kbr-scylla commented 3 months ago

but I want to point out that I think the real problem in your example above is that "kill node A" neglects to wait for node A to be killed. If you want to do "kill node A" followed in quick succession by "start node A", it doesn't make sense to not wait for node A to be killed before starting it again.

I see it differently. The semantics of wait_other_notice should be as the name suggests: we wait until other nodes notice this particular start event. (Not some earlier start event.)

The test is correct when written using this assumption, this semantics. (And it was indeed so -- before https://github.com/scylladb/scylla-ccm/pull/462)

scylladb / scylla-ccm

`node.start(wait_other_notice=True, ...)` may return too early because implementation of `wait_other_notice` may observe past state #563