Connection pool renewal after concurrent node bootstraps causes double statement execution

kbr-scylla commented 4 months ago

We boot 2 Scylla nodes concurrently into existing cluster. Python driver obtains two on_add notifications, one for each node. Each notification calls add_or_renew_pool, which creates connection pool to each node.

But then, for some reason, one of the on_adds may cause another add_or_renew_pool to be called for the other server. This happens from _finalize_add -> update_created_pools. This may cause a second pool to be created for the other server and the initially established pool to that server to be closed.

There could be a statement running on the initially established pool. The statement may have already been executed on Scylla side, but the driver didn't get a response yet. The pool is closed before response arrives. This causes driver to retry the statement on the new pool, leading to double execution.

In our tests, we observe this by "CREATE KEYSPACE" statement failing with "already exists" error message (https://github.com/scylladb/scylladb/issues/17654)

Reproducer:

Python driver branch with sleep + logging added: https://github.com/kbr-scylla/python-driver/tree/debug-double-execution

I added a tactical sleep there:

diff --git a/cassandra/cluster.py b/cassandra/cluster.py
index 8ed0647b..e79daf7e 100644
--- a/cassandra/cluster.py
+++ b/cassandra/cluster.py
@@ -3320,6 +3320,8 @@ class Session(object):
                         self._lock.acquire()
                         return False
                     self._lock.acquire()
+                if previous:
+                    time.sleep(2)
                 self._pools[host] = new_pool

             log.debug("Added pool for host %s to session", host)

ScyllaDB branch with sleep + logging added before "create keyspace" statement returns: https://github.com/kbr-scylla/scylladb/tree/debug-double-execution

this is just coroutinization of create_keyspace_statement::execute, then sleep + logging added:

diff --git a/cql3/statements/create_keyspace_statement.cc b/cql3/statements/create_keyspace_statement.cc
index e66779ac0d..f8b3b1f766 100644
--- a/cql3/statements/create_keyspace_statement.cc
+++ b/cql3/statements/create_keyspace_statement.cc
@@ -267,13 +267,15 @@ std::vector<sstring> check_against_restricted_replication_strategies(
 future<::shared_ptr<messages::result_message>>
 create_keyspace_statement::execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {
     std::vector<sstring> warnings = check_against_restricted_replication_strategies(qp, keyspace(), *_attrs, qp.get_cql_stats());
-        return schema_altering_statement::execute(qp, state, options, std::move(guard)).then([warnings = std::move(warnings)] (::shared_ptr<messages::result_message> msg) {
-        for (const auto& warning : warnings) {
-            msg->add_warning(warning);
-            mylogger.warn("{}", warning);
-        }
-        return msg;
-    });
+    auto msg = co_await schema_altering_statement::execute(qp, state, options, std::move(guard));
+    for (const auto& warning : warnings) {
+        msg->add_warning(warning);
+        mylogger.warn("{}", warning);
+    }
+    mylogger.info("sleep before returning create keyspace message");
+    co_await seastar::sleep(std::chrono::seconds{2});
+    mylogger.info("return create keyspace message");
+    co_return std::move(msg);
 }

Test (included in the above branch):

 @pytest.mark.asyncio
async def test_double_execution(request, manager: ManagerClient):
    await manager.server_add()
    await manager.servers_add(2)

    logging.info(f'SLEEP 1')
    await asyncio.sleep(1)

    cql = manager.get_cql()
    hosts = cql.cluster.metadata.all_hosts()
    logging.info(f"hosts: {hosts}")

    logging.info(f'create ks')
    await cql.run_async("create keyspace ks with replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")

I run it like this:

PYTHONPATH=$PYTHONPATH:/home/kbr/dev/python-driver ./test.py --mode dev test_double_execution --repeat 4

in /home/kbr/dev/python-driver I have the above Python driver branch checked out.

Logs from example run: scylla-10.log scylla-9.log scylla-3.log topology_custom.test_double_execution.3.log

Here are the relevant excerpts cut out from the test log (those are messages I added):

11:29:55.414 INFO> on_add add_or_renew_pool 127.58.145.9:9042
11:29:55.414 INFO> on_add add_or_renew_pool 127.58.145.10:9042
11:29:55.414 INFO> SLEEP 1
11:29:55.417 INFO> finalize_add update_created_pools 127.58.145.10:9042
11:29:55.417 INFO> update_created_pools add_or_renew_pool 127.58.145.9:9042
11:29:55.418 INFO> finalize_add update_created_pools 127.58.145.9:9042
11:29:56.415 INFO> create ks
11:29:57.421 DEBUG> set new pool 127.58.145.9:9042 previous True
11:29:57.421 DEBUG> Shutting down connections to 127.58.145.9:9042

What happened is that finalize_add for host 127.58.145.10:9042 (scylla-10) caused update_created_pools call, which called add_or_renew_pool for host 127.58.145.9:9042 (scylla-9). But pool for scylla-9 was already established. We start running "create keyspace" on scylla-9. In the meantime, add_or_renew_pool establishes a new pool and drops the old one, causing "create keyspace" to be retried on the new pool, leading to double execution:

>       await cql.run_async("create keyspace ks with replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")
E       cassandra.AlreadyExists: Keyspace 'ks' already exists

kbr-scylla commented 4 months ago

This can be easily worked around in tests, so it's low priority.

sylwiaszunejko commented 3 months ago

I managed to reproduce the issue, but what's interesting it doesn't reproduce when using consistent-topology-changes feature. @kbr-scylla do you have any idea why? I am not sure if this is relevant.

kbr-scylla commented 3 months ago

I managed to reproduce the issue, but what's interesting it doesn't reproduce when using consistent-topology-changes feature. @kbr-scylla do you have any idea why? I am not sure if this is relevant.

Huh? The reproducer I posted above:

ScyllaDB branch with sleep + logging added before "create keyspace" statement returns: https://github.com/kbr-scylla/scylladb/tree/debug-double-execution

runs in consistent-topology-changes mode.

In fact it explicitly depends on consistent-topology-changes, because it bootstraps 2 nodes concurrently (it uses manager.servers_add(2)). The idea behind the reproducer is that notifications about the two concurrently booting nodes arrive at the driver at roughly the same time and race with each other. I did not manage to reproduce the issue by booting nodes sequentially - then the notifications are arriving sequentially and the problem does not seem to happen.

How did you reproduce the issue outside consistent-topology-changes -- did you use some different test case? Are you sure you reproduced this exact issue and not something different (maybe similar)?

sylwiaszunejko commented 3 months ago

Maybe I misunderstood something, I used exactly what is in your fork, but after I added cfg = {'experimental_features': ['consistent-topology-changes'],} in config option of servers_add it stopped reproducing. Probably some stupid mistake on my end, if so sorry for that.

kbr-scylla commented 3 months ago

When was that? cfg = {'experimental_features': ['consistent-topology-changes']} doesn't do anything after scylladb/scylladb@d8313dda43d7014460891e7da7dab785d4ef9030, and before scylladb/scylladb@d8313dda43d7014460891e7da7dab785d4ef9030 it was the default in topology tests.

So I suspect that your change didn't actually change anything, but then you were unlucky with your runs and failed to reproduce it, used a too small sample, and then concluded that the change was the reason.

kbr-scylla commented 3 months ago

Hm, actually cfg = {'experimental_features': ['consistent-topology-changes']} could do something, it disables some other experimental features. The usual set of experimental features on that branch was:

         'experimental_features': ['udf',
                                   'alternator-streams',
                                   'consistent-topology-changes',
                                   'broadcast-tables',
                                   'keyspace-storage-options'],

and you reduced it to simply ['consistent-topology-changes']

But neither of these other features seems likely to be related, so I would again recheck with higher sample of runs

(Preferably -- just rebase my reproducer branch on latest master, where consistent-topology-changes is no longer experimental, and check there)

kostja commented 2 months ago

Ping on this one, the connected issue https://github.com/scylladb/scylladb/issues/16219 is impacting our ability to run sct tests.

kbr-scylla commented 2 months ago

@kostja , modifying python driver will not help with java driver failures.

sylwiaszunejko commented 2 months ago

This can be easily worked around in tests, so it's low priority.

We though that this issue is a low priority one so it was not planned for this sprint, should we reconsider that?

scylladb / python-driver

Connection pool renewal after concurrent node bootstraps causes double statement execution #317