[Shardaware] `USE keyspace` CQL command are not always executed on all connections

fruch commented 1 year ago

Once https://github.com/scylladb/scylla-tools-java/pull/319 was merged we run into test that started failing

Seems like not all the connections are getting the USE keyspace, or getting it after other CQL command already executing on them.

Ref: https://github.com/scylladb/scylla-dtest/issues/2923

fruch commented 1 year ago

@avelanarius Can you help us with this one ? (otherwise we might need to revert some cqlsh work recent work)

avelanarius commented 1 year ago

Quick update: Today me and @Lorak-mmk together looked at the issue. We were not able to reproduce the failure when running dtests locally, also we ran cqlsh "stress test" (invoking USE keyspace repeatedly) without hitting the problem and started on analyzing the driver code. I'll continue the work tomorrow.

fruch commented 1 year ago

Quick update: Today me and @Lorak-mmk together looked at the issue. We were not able to reproduce the failure when running dtests locally, also we ran cqlsh "stress test" (invoking USE keyspace repeatedly) without hitting the problem and started on analyzing the driver code. I'll continue the work tomorrow.

try with:

export SCYLLA_EXT_OPTS="--smp 2 --memory 1024M"

or even with more smp, it might make it fail more quickly.

as for a stress, I think it's mostly about the initial connection being made, after the point all connection are connected, I don't think there are problem todo USE keyspace, I think it's a race regarding USE keyspace happening before all of the connection are opened to all shards (do during them getting opened, but no called set key space)

the error BTW is on the next CQL command, i.e. SELECT without mentioning keyspace, just the table name

nyh commented 1 year ago

Quick update: Today me and @Lorak-mmk together looked at the issue. We were not able to reproduce the failure when running dtests locally, also we ran cqlsh "stress test" (invoking USE keyspace repeatedly) without hitting the problem and started on analyzing the driver code. I'll continue the work tomorrow.

If I understand @fruch's theory correctly, it is that when the USE is called by the user very early in the driver's life, the extra connections to other shards have not yet been opened, and they are opened a fraction of a second later in the background - and forget to copy the current USE. If this is the case then reproducing this problem may require very slow background opening of the other connections - and may only happen rarely on very slow debug builds on Jenkins. We also know that this problem happens rarely - we get a different test function failing each time, it's not like all of them are failing all the time.

I think to reproduce this problem during your debugging the easiest approach will be to hack the driver code to make the problem more prominent. If @fruch 's theory is correct, then if you add a sleep in the beginning of the background opening of the extra sharded connections, and also add a sleep in the client, then your test should fail every time. Or maybe still not every time - maybe the request reaches a random (?) shard so you have some chance of reaching the old shard with USE, and some chance of reaching the new one which doesn't know it.

avelanarius commented 1 year ago

Quick update: Today me and @Lorak-mmk together looked at the issue. We were not able to reproduce the failure when running dtests locally, also we ran cqlsh "stress test" (invoking USE keyspace repeatedly) without hitting the problem and started on analyzing the driver code. I'll continue the work tomorrow.

If I understand @fruch's theory correctly, it is that when the USE is called by the user very early in the driver's life, the extra connections to other shards have not yet been opened, and they are opened a fraction of a second later in the background - and forget to copy the current USE. If this is the case then reproducing this problem may require very slow background opening of the other connections - and may only happen rarely on very slow debug builds on Jenkins. We also know that this problem happens rarely - we get a different test function failing each time, it's not like all of them are failing all the time.

I started trying out this hypothesis - introducing an artificial delay for connections to other shards. I haven't yet been able to see the issue - but tested it only for a short time.

I think to reproduce this problem during your debugging the easiest approach will be to hack the driver code to make the problem more prominent. If @fruch 's theory is correct, then if you add a sleep in the beginning of the background opening of the extra sharded connections.

Yes, we started trying out this scenario.

nyh commented 1 year ago

I started trying out this hypothesis - introducing an artificial delay for connections to other shards. I haven't yet been able to see the issue - but tested it only for a short time.

Don't forget it's not enough to delay the extra connections - you also need to delay the client's second request (the one trying to use the table without a keyspace) even longer. If you don't, and the client's second request comes quickly, I am guessing (?) it will just send the request to the original connection, the one with the correct USE, and everything will be fine. I think you need to sleep in the client too - so it only sends the second request after the new shards are connected. Even then, it might be (?) probabilistic which shard you reach so maybe you'll need to run the same test multiple times before seeing it fail.

avelanarius commented 1 year ago

Success! Reproduced!

Artifical delays to Python Driver:

diff --git a/cassandra/connection.py b/cassandra/connection.py
index c3ba42d7..bba4568b 100644
--- a/cassandra/connection.py
+++ b/cassandra/connection.py
@@ -1501,6 +1501,12 @@ class Connection(object):
         if not keyspace or keyspace == self.keyspace:
             return

+        print("Yes, i'm sleeping (3)")
+        import time
+        import random
+        time.sleep(random.uniform(0.0, 0.25))
+        print("Yes, i'm sleeping (3) done")
+
         query = QueryMessage(query='USE "%s"' % (keyspace,),
                              consistency_level=ConsistencyLevel.ONE)
         try:
@@ -1555,6 +1561,12 @@ class Connection(object):
             callback(self, None)
             return

+        print("Yes, i'm sleeping (2)")
+        import time
+        import random
+        time.sleep(random.uniform(0.0, 0.25))
+        print("Yes, i'm sleeping (2) done")
+
         query = QueryMessage(query='USE "%s"' % (keyspace,),
                              consistency_level=ConsistencyLevel.ONE)

diff --git a/cassandra/pool.py b/cassandra/pool.py
index 2f3fea93..b2132d99 100644
--- a/cassandra/pool.py
+++ b/cassandra/pool.py
@@ -691,6 +691,11 @@ class HostConnection(object):
         the smaller the chance that further connections will be assigned
         to that shard.
         """
+        print("Yes, i'm sleeping")
+        import time
+        import random
+        time.sleep(random.uniform(0.0, 0.25))
+        print("Yes, i'm sleeping done")
         with self._lock:
             if self.is_shutdown:
                 return

Reproduction script:

from cassandra.cluster import Cluster

# Before running create repro00, repro01, ..., repro05 keyspaces
# with corresponding tab00, tab01, ..., tab05 tables

for it in range(1000):
    cluster = Cluster(['127.0.0.2'])
    s = cluster.connect()

    s.execute(f"USE repro0{it % 6}")

    for i in range(700):
        if i % 100 == 0:
            print(i)
        s.execute(f"SELECT * FROM tab0{it % 6} WHERE pk = 1 AND ck = 2")

    s.shutdown()
    cluster.shutdown()

Typical output:

$ python3 repro.py 
Yes, i'm sleeping
Yes, i'm sleeping
Yes, i'm sleeping (2)
Yes, i'm sleeping (2) done
Yes, i'm sleeping (2)
Yes, i'm sleeping (2) done
0
Yes, i'm sleeping done
Yes, i'm sleeping (3)
Yes, i'm sleeping done
Traceback (most recent call last):
  File "/home/piotrgrabowski/scylla/smieci/python-driver/repro.py", line 15, in <module>
    s.execute(f"SELECT * FROM tab0{it % 6} WHERE pk = 1 AND ck = 2")
  File "/home/piotrgrabowski/scylla/smieci/python-driver/cassandra/cluster.py", line 2699, in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/piotrgrabowski/scylla/smieci/python-driver/cassandra/cluster.py", line 5006, in result
    raise self._final_exception
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="No keyspace has been specified. USE a keyspace, or explicitly specify keyspace.tablename"
Yes, i'm sleeping (3) done
Yes, i'm sleeping
Yes, i'm sleeping (3)
Yes, i'm sleeping done
Yes, i'm sleeping (3) done
Yes, i'm sleeping
Yes, i'm sleeping (3)
Yes, i'm sleeping (3) done

fruch commented 1 year ago

I started trying out this hypothesis - introducing an artificial delay for connections to other shards. I haven't yet been able to see the issue - but tested it only for a short time.

Don't forget it's not enough to delay the extra connections - you also need to delay the client's second request (the one trying to use the table without a keyspace) even longer. If you don't, and the client's second request comes quickly, I am guessing (?) it will just send the request to the original connection, the one with the correct USE, and everything will be fine. I think you need to sleep in the client too - so it only sends the second request after the new shards are connected. Even then, it might be (?) probabilistic which shard you reach so maybe you'll need to run the same test multiple times before seeing it fail.

I'll add to this

Trying to create a select that must land on shard=1 (or any other shard not from the initial connection)

I think it's trying to send this select in a very small time frame, between the new connection is opened and added to the connections dict, to the time the set keyspace command is execute on that connect.

I think we don't block this connection from getting requests, if it's in the dict, it's mean it's open and ready.

avelanarius commented 1 year ago

And even cleaner reproducer (still needs artifical sleeps in driver as posted here: https://github.com/scylladb/python-driver/issues/187#issuecomment-1329800783):

from cassandra.cluster import Cluster

cluster = Cluster(['127.0.0.2'])
s = cluster.connect()

s.execute("CREATE KEYSPACE IF NOT EXISTS newrepro WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}")
s.execute("CREATE TABLE IF NOT EXISTS newrepro.tab(pk int, PRIMARY KEY(pk))")

s.shutdown()
cluster.shutdown()

cluster = Cluster(['127.0.0.2'])
s = cluster.connect()

s.execute(f"USE newrepro")

for i in range(200):
    s.execute(f"SELECT * FROM tab WHERE pk = 1")

s.shutdown()
cluster.shutdown()

fruch commented 1 year ago

Also it's reproduced (even locally for me, not 100%) with the driver embedded inside cqlsh together with dtest.

So you need to pick specific scylla version from master that has it in.

fruch commented 1 year ago

fixed by #190

scylladb / python-driver

[Shardaware] `USE keyspace` CQL command are not always executed on all connections #187