scylladb / python-driver

ScyllaDB Python Driver, originally DataStax Python Driver for Apache Cassandra
https://python-driver.docs.scylladb.com
Apache License 2.0
70 stars 42 forks source link

After a node restarts, driver reconnects multiple times, causing queries to fail #295

Open kbr-scylla opened 6 months ago

kbr-scylla commented 6 months ago

In test.py tests, after restarting a node, we're often using the utility function wait_for_cql (or wait_for_cql_and_get_hosts) on a pre-existing driver session (which is created at the beginning of the test and connected to the cluster), to wait for that session to be able to query the restarted node.

This is the logic:

async def wait_for(
        pred: Callable[[], Awaitable[Optional[T]]],
        deadline: float,
        period: float = 1,
        before_retry: Optional[Callable[[], Any]] = None) -> T:
    while True:
        assert(time.time() < deadline), "Deadline exceeded, failing test."
        res = await pred()
        if res is not None:
            return res
        await asyncio.sleep(period)
        if before_retry:
            before_retry()

async def wait_for_cql(cql: Session, host: Host, deadline: float) -> None:
    async def cql_ready():
        try:
            await cql.run_async("select * from system.local", host=host)
        except NoHostAvailable:
            logging.info(f"Driver not connected to {host} yet")
            return None
        return True
    await wait_for(cql_ready, deadline)

async def wait_for_cql_and_get_hosts(cql: Session, servers: list[ServerInfo], deadline: float) \
        -> list[Host]:
    """Wait until every server in `servers` is available through `cql`
       and translate `servers` to a list of Cassandra `Host`s.
    """
    ip_set = set(str(srv.rpc_address) for srv in servers)
    async def get_hosts() -> Optional[list[Host]]:
        hosts = cql.cluster.metadata.all_hosts()
        remaining = ip_set - {h.address for h in hosts}
        if not remaining:
            return hosts

        logging.info(f"Driver hasn't yet learned about hosts: {remaining}")
        return None
    hosts = await wait_for(
        pred=get_hosts,
        deadline=deadline,
        before_retry=lambda: cql.cluster.refresh_nodes(force_token_rebuild=True),
    )

    # Take only hosts from `ip_set` (there may be more)
    hosts = [h for h in hosts if h.address in ip_set]
    await asyncio.gather(*(wait_for_cql(cql, h, deadline) for h in hosts))

    return hosts

In short, we're waiting until select * from system.local succeeds on that Host. wait_for_cql_and_get_hosts additionally translates an IP address in the form of a string into an object of the Host type (defined by the driver).

Then a test may attempt to perform a query -- immediately after calling wait_for_cql / wait_for_cql_and_get_hosts.

Unfortunately, that query may fail.

The reason is (judging from driver logs), that even after the driver reconnected and select * from system.local succeeded, the driver may for some reason decide to reconnect again, which involves dropping the existing connections and creating new ones. This might happen in the middle of the next query, causing the query to fail. In fact the driver may reconnect multiple times in this way.

Here's a fragment of relevant logs from one of test.py test runs:

10:04:15.630 INFO> Server Server(597, 127.167.175.35, 127.167.175.35) restarted
10:04:15.645 DEBUG> Connection 281470374861840: '('127.0.0.1', 34094)' -> '('127.167.175.35', 9042)'
10:04:15.645 DEBUG> Connection 281470374907600: '('127.0.0.1', 34106)' -> '('127.167.175.35', 9042)'
10:04:15.645 DEBUG> Sending initial options message for new connection (281470374861840) to 127.167.175.35:9042
10:04:15.645 DEBUG> Sending initial options message for new connection (281470374907600) to 127.167.175.35:9042
10:04:15.649 DEBUG> Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'TABLETS_ROUTING_V1': ['']}
10:04:15.650 DEBUG> Received options response on new connection (281470374861840) from 127.167.175.35:9042
10:04:15.650 DEBUG> No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
10:04:15.650 DEBUG> Sending StartupMessage on <CustomConnection(281470374861840) 127.167.175.35:9042>
10:04:15.650 DEBUG> Sent StartupMessage on <CustomConnection(281470374861840) 127.167.175.35:9042>
10:04:15.650 DEBUG> Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'TABLETS_ROUTING_V1': ['']}
10:04:15.650 DEBUG> Received options response on new connection (281470374907600) from 127.167.175.35:9042
10:04:15.651 DEBUG> No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
10:04:15.651 DEBUG> Sending StartupMessage on <CustomConnection(281470374907600) 127.167.175.35:9042>
10:04:15.651 DEBUG> Sent StartupMessage on <CustomConnection(281470374907600) 127.167.175.35:9042>
10:04:15.658 DEBUG> Got ReadyMessage on new connection (281470374907600) from 127.167.175.35:9042
10:04:15.658 INFO> Successful reconnection to 127.167.175.35:9042, marking node up if it isn't already
10:04:15.658 DEBUG> Waiting to acquire lock for handling up status of node 127.167.175.35:9042
10:04:15.659 DEBUG> Starting to handle up status of node 127.167.175.35:9042
10:04:15.659 DEBUG> Got ReadyMessage on new connection (281470374861840) from 127.167.175.35:9042
10:04:15.659 INFO> Host 127.167.175.35:9042 may be up; will prepare queries and open connection pool
10:04:15.659 INFO> Successful reconnection to 127.167.175.35:9042, marking node up if it isn't already
10:04:15.659 DEBUG> Now that host 127.167.175.35:9042 is up, cancelling the reconnection handler
10:04:15.659 DEBUG> Waiting to acquire lock for handling up status of node 127.167.175.35:9042
10:04:15.660 DEBUG> Done preparing all queries for host 127.167.175.35:9042, 
10:04:15.660 DEBUG> Starting to handle up status of node 127.167.175.35:9042
10:04:15.660 DEBUG> Removed connection pool for <Host: 127.167.175.35:9042>
10:04:15.660 INFO> Host 127.167.175.35:9042 may be up; will prepare queries and open connection pool
10:04:15.660 DEBUG> Signalling to load balancing policies that host 127.167.175.35:9042 is up
10:04:15.660 DEBUG> Now that host 127.167.175.35:9042 is up, cancelling the reconnection handler
10:04:15.661 DEBUG> Signalling to control connection that host 127.167.175.35:9042 is up
10:04:15.661 DEBUG> Done preparing all queries for host 127.167.175.35:9042, 
10:04:15.661 DEBUG> Attempting to open new connection pools for host 127.167.175.35:9042
10:04:15.661 DEBUG> Signalling to load balancing policies that host 127.167.175.35:9042 is up
10:04:15.661 DEBUG> Closing connection (281470374907600) to 127.167.175.35:9042
10:04:15.663 DEBUG> Signalling to control connection that host 127.167.175.35:9042 is up
10:04:15.664 DEBUG> Attempting to open new connection pools for host 127.167.175.35:9042
10:04:15.664 DEBUG> Closing connection (281470374861840) to 127.167.175.35:9042
10:04:15.665 DEBUG> Closed socket to 127.167.175.35:9042
10:04:15.665 DEBUG> Closed socket to 127.167.175.35:9042
10:04:15.666 DEBUG> Shutting down connections to 127.167.175.35:9042
10:04:15.666 DEBUG> Initializing connection for host 127.167.175.35:9042
10:04:15.666 DEBUG> Closing connection (281470374626576) to 127.167.175.35:9042
10:04:15.667 DEBUG> Closing connection (281470374626576) to 127.167.175.35:9042
10:04:15.667 DEBUG> Closed socket to 127.167.175.35:9042
10:04:15.667 DEBUG> Closing excess connection (281470374625424) to 127.167.175.35:9042
10:04:15.668 DEBUG> Connection 281470371258256: '('127.0.0.1', 34114)' -> '('127.167.175.35', 9042)'
10:04:15.668 DEBUG> Closing connection (281470374625424) to 127.167.175.35:9042
10:04:15.668 DEBUG> Sending initial options message for new connection (281470371258256) to 127.167.175.35:9042
10:04:15.668 DEBUG> Closed socket to 127.167.175.35:9042
10:04:15.669 DEBUG> Initializing connection for host 127.167.175.35:9042
10:04:15.670 DEBUG> Connection 281470373696208: '('127.0.0.1', 34116)' -> '('127.167.175.35', 9042)'
10:04:15.670 DEBUG> Sending initial options message for new connection (281470373696208) to 127.167.175.35:9042
10:04:15.677 DEBUG> Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'TABLETS_ROUTING_V1': ['']}
10:04:15.677 DEBUG> Received options response on new connection (281470373696208) from 127.167.175.35:9042
10:04:15.677 DEBUG> No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
10:04:15.677 DEBUG> Sending StartupMessage on <CustomConnection(281470373696208) 127.167.175.35:9042>
10:04:15.677 DEBUG> Sent StartupMessage on <CustomConnection(281470373696208) 127.167.175.35:9042>
10:04:15.678 DEBUG> Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'TABLETS_ROUTING_V1': ['']}
10:04:15.678 DEBUG> Received options response on new connection (281470371258256) from 127.167.175.35:9042
10:04:15.678 DEBUG> No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
10:04:15.678 DEBUG> Sending StartupMessage on <CustomConnection(281470371258256) 127.167.175.35:9042>
10:04:15.678 DEBUG> Sent StartupMessage on <CustomConnection(281470371258256) 127.167.175.35:9042>
10:04:15.683 DEBUG> Got ReadyMessage on new connection (281470371258256) from 127.167.175.35:9042
10:04:15.684 DEBUG> First connection created to 127.167.175.35:9042 for shard_id=0
10:04:15.686 DEBUG> Finished initializing connection for host 127.167.175.35:9042
10:04:15.686 DEBUG> Got ReadyMessage on new connection (281470373696208) from 127.167.175.35:9042
10:04:15.687 DEBUG> Added pool for host 127.167.175.35:9042 to session
10:04:15.687 DEBUG> First connection created to 127.167.175.35:9042 for shard_id=0
10:04:15.687 INFO> Connection pools established for node 127.167.175.35:9042
10:04:15.687 DEBUG> Finished initializing connection for host 127.167.175.35:9042
10:04:15.688 DEBUG> Host 127.167.175.35:9042 is now marked up
10:04:15.688 DEBUG> Added pool for host 127.167.175.35:9042 to session
10:04:15.688 DEBUG> shard_aware_endpoint=None
10:04:15.688 DEBUG> Shutting down connections to 127.167.175.35:9042
10:04:15.689 DEBUG> Closing connection (281470371258256) to 127.167.175.35:9042
10:04:15.689 DEBUG> Closing connection (281470371258256) to 127.167.175.35:9042
10:04:15.689 DEBUG> Closed socket to 127.167.175.35:9042
10:04:15.689 INFO> Connection pools established for node 127.167.175.35:9042
10:04:15.689 DEBUG> Host 127.167.175.35:9042 is now marked up
10:04:15.689 DEBUG> shard_aware_endpoint=None
10:04:15.690 DEBUG> Connection 281470371266320: '('127.0.0.1', 34118)' -> '('127.167.175.35', 9042)'
10:04:15.690 DEBUG> Sending initial options message for new connection (281470371266320) to 127.167.175.35:9042
10:04:15.690 DEBUG> Connection 281470371261456: '('127.0.0.1', 34134)' -> '('127.167.175.35', 9042)'
10:04:15.691 DEBUG> Sending initial options message for new connection (281470371261456) to 127.167.175.35:9042
10:04:15.706 DEBUG> Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'TABLETS_ROUTING_V1': ['']}
10:04:15.707 DEBUG> Received options response on new connection (281470371261456) from 127.167.175.35:9042
10:04:15.707 DEBUG> No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
10:04:15.707 DEBUG> Sending StartupMessage on <CustomConnection(281470371261456) 127.167.175.35:9042>
10:04:15.707 DEBUG> Sent StartupMessage on <CustomConnection(281470371261456) 127.167.175.35:9042>
10:04:15.707 DEBUG> Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['1'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'TABLETS_ROUTING_V1': ['']}
10:04:15.708 DEBUG> Received options response on new connection (281470371266320) from 127.167.175.35:9042
10:04:15.708 DEBUG> No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
10:04:15.708 DEBUG> Sending StartupMessage on <CustomConnection(281470371266320) 127.167.175.35:9042>
10:04:15.708 DEBUG> Sent StartupMessage on <CustomConnection(281470371266320) 127.167.175.35:9042>
10:04:15.715 DEBUG> Got ReadyMessage on new connection (281470371266320) from 127.167.175.35:9042
10:04:15.716 DEBUG> Received a connection 281470371266320 for shard_id=1 on host 127.167.175.35:9042
10:04:15.716 DEBUG> Pool for host 127.167.175.35:9042 is in shutdown, closing the new connection (281470371266320)
10:04:15.716 DEBUG> Closing connection (281470371266320) to 127.167.175.35:9042
10:04:15.716 DEBUG> Closed socket to 127.167.175.35:9042
10:04:15.720 DEBUG> Got ReadyMessage on new connection (281470371261456) from 127.167.175.35:9042
10:04:15.724 DEBUG> Received a connection 281470371261456 for shard_id=0 on host 127.167.175.35:9042
10:04:15.725 DEBUG> Putting a connection 281470371261456 to shard 0 to the excess pool of host 127.167.175.35:9042

We can clearly see that the driver managed to reconnect

10:04:15.658 INFO> Successful reconnection to 127.167.175.35:9042, marking node up if it isn't already
10:04:15.658 DEBUG> Waiting to acquire lock for handling up status of node 127.167.175.35:9042
10:04:15.659 DEBUG> Starting to handle up status of node 127.167.175.35:9042
10:04:15.659 DEBUG> Got ReadyMessage on new connection (281470374861840) from 127.167.175.35:9042

but then, for some reason, drops connections...

10:04:15.661 DEBUG> Attempting to open new connection pools for host 127.167.175.35:9042
10:04:15.661 DEBUG> Signalling to load balancing policies that host 127.167.175.35:9042 is up
10:04:15.661 DEBUG> Closing connection (281470374907600) to 127.167.175.35:9042

One curious thing I noticed is this:

10:04:15.659 DEBUG> Now that host 127.167.175.35:9042 is up, cancelling the reconnection handler

this suggest that there is some kind of reconnection logic -- perhaps reacting to events? -- and that logic is supposed to be cancelled. However, reconnection still happens.

I have a suspicion that the driver is perhaps reacting to notifications from Scylla. Different nodes may be noticing that our node is UP at different times, and sending notifications to the driver (additionally, these notifications can also be delayed for whatever reason). Whenever the driver receives such a notification, this triggers reconnection.

But -- perhaps that reconnection logic should be cancelled when the driver first manages to reconnect? The message above suggests such mechanism being already in place, but apparently it's not, or it doesn't work. Or, perhaps upon receiving such notification, the driver should first check that the existing connections (if present) are alive (exchange a packet back and forth?), and only if they're not, attempt reconnection?

kbr-scylla commented 6 months ago

Note: in the above run, the query failed with

E       cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.167.175.35:9042 datacenter1>: ConnectionException('Host has been marked down or removed')})
kbr-scylla commented 6 months ago

Perhaps a workaround in the tests would be to first wait for all nodes to see the restarted node as UP, and only then do the wait_for_cql. This way, by the time select * from system.local succeeds, there should be no notifications from Scylla any more, thus no driver reconnects...

Except I'm not sure. It may be true that the notifications are sent as nodes observe the restarted node as UP -- but they may be received by the driver much later, I don't think there's an ordering guarantee. I'm 99.999% sure there's not --- if we wanted such guarantee, the /gossiper/endpoint/live Scylla API would have to keep returning DOWN until all notifications to all driver sessions are delivered. I doubt gossiper is blocking on drivers, it would be too much of external dependency of internal Scylla behavior.

Still, the workaround could potentially minimize the probability of the issue.

Another workaround is to recreate the driver session every time we restart a node, and this was applied to some tests already (big story here: https://github.com/scylladb/scylladb/issues/14746)

Unfortunately IIRC it's not a 100% workaround, and the driver may still decide to reconnect again (perhaps the newly created driver session is again receiving notifications from Scylla), although the probability is much smaller.

Also, dtests likely don't have this problem because instead of using a single cluster-wide session, they are using "exclusive" sessions i.e. every node has its own session connected to it, and using a policy, the driver is not connecting to other nodes (?). If that's the case, there should be no notifications from other nodes. But I'm not sure if using an execution profile or policy actually prevents a session from connecting to other nodes? Or it only prevents from querying the nodes, but the pool is still established? (cc @fruch -- maybe you could confirm or deny my reasoning about how dtests use the driver)

kbr-scylla commented 6 months ago

cc @kostja @nyh

fruch commented 6 months ago

Perhaps a workaround in the tests would be to first wait for all nodes to see the restarted node as UP, and only then do the wait_for_cql. This way, by the time select * from system.local succeeds, there should be no notifications from Scylla any more, thus no driver reconnects...

Except I'm not sure. It may be true that the notifications are sent as nodes observe the restarted node as UP -- but they may be received by the driver much later, I don't think there's an ordering guarantee. I'm 99.999% sure there's not --- if we wanted such guarantee, the /gossiper/endpoint/live Scylla API would have to keep returning DOWN until all notifications to all driver sessions are delivered. I doubt gossiper is blocking on drivers, it would be too much of external dependency of internal Scylla behavior.

Still, the workaround could potentially minimize the probability of the issue.

Another workaround is to recreate the driver session every time we restart a node, and this was applied to some tests already (big story here: scylladb/scylladb#14746)

Unfortunately IIRC it's not a 100% workaround, and the driver may still decide to reconnect again (perhaps the newly created driver session is again receiving notifications from Scylla), although the probability is much smaller.

Also, dtests likely don't have this problem because instead of using a single cluster-wide session, they are using "exclusive" sessions i.e. every node has its own session connected to it, and using a policy, the driver is not connecting to other nodes (?). If that's the case, there should be no notifications from other nodes. But I'm not sure if using an execution profile or policy actually prevents a session from connecting to other nodes? Or it only prevents from querying the nodes, but the pool is still established? (cc @fruch -- maybe you could confirm or deny my reasoning about how dtests use the driver)

this last paragraph about dtest isn't completely accurate, we have "exclusive" connections, but test are using both "exclusive" and connection which are not. up until not that long ago, restart of a node was taking quite a long time cause of wait_for_gossip and it's friends, some test are till today suffering for similar issues to what you are describing, @temichus was trying to address it, with a similar direction to you suggested workaround. we seen it in cases where all of the nodes were taken down. this is something very rare in dtest tests, if only one node is down, we don't notices such issues.

Also to note, dtest doesn't use the same session across tests, and in lot of places uses session context manager, i.e. it's being closed right after the usage is done.

as for the pool in exclusive move, it's hold just that nodes which are whitelisted, it happened during listing the nodes from system tables, so when it work in that mode, it's blind to all other nodes complectly, it doesn't open connection to them, and not getting notification from them. (might be getting notification about them, UP/DOWN and such, but shouldn't be opening connections based on those)

as for the actual issue, the test you are referring to are running with smp=2 or more ? cause it might be that the we have one connection to a node up and ready, which the 2nd cpu connection isn't ready yet, and if a query is executed at that time, it might reach into a connection that going to be replaced cause of the shard aware logic, i.e. driver doesn't know which cpu isn't connecting, and hence closes and reopen connections till it hits the right cpu, all that is happening in the background, and this might be the cause of it.

are you disabling shared-aware ports in those tests ? (using shard aware port might minimize that dance of opening/closing connections)

fruch commented 6 months ago

@kbr-scylla

see https://github.com/search?q=repo%3Ascylladb%2Fscylla-dtest%20wait_reconnection&type=code which kind of equivalent to wait_for_cql, and it's used only in 4 specific tests in dtest

kbr-scylla commented 6 months ago

as for the actual issue, the test you are referring to are running with smp=2 or more ?

Yes, smp=2

cause it might be that the we have one connection to a node up and ready, which the 2nd cpu connection isn't ready yet, and if a query is executed at that time, it might reach into a connection that going to be replaced cause of the shard aware logic, i.e. driver doesn't know which cpu isn't connecting, and hence closes and reopen connections till it hits the right cpu, all that is happening in the background, and this might be the cause of it.

are you disabling shared-aware ports in those tests ? (using shard aware port might minimize that dance of opening/closing connections)

The shard aware port is enabled.

But you did point out an important fact, that doing select * from system.local is not enough, as this will only test connection to one shard, while connection to the other shard might not even be established yet... And the following query may go to the other shard. So we also have a problem in test.py, we need some better way to test that connections are established which would check all shards. Not sure if it's possible, we may need better guarantees / API from the driver after all. Also, this doesn't change the fact that the driver seems to be reconnecting like crazy, if it was only one reconnect per shard I would understand, but the logs show more.

sylwiaszunejko commented 6 months ago

@kbr-scylla I wrote the test to reproduce the issue https://github.com/sylwiaszunejko/scylladb/tree/multiple_driver_reconnections, but it did not occur for me. I run it many times. Maybe there is something wrong with the test, could you look at it at give some advice?

avelanarius commented 6 months ago

The cause of this issue could be the same as https://github.com/scylladb/scylladb/issues/16709. However, reproducing the issue is very tough (@sylwiaszunejko managed to reproduce it once every 1h or less often). @kbr-scylla ping, do you happen to know a good reproducer?

kbr-scylla commented 6 months ago

The cause of this issue could be the same as https://github.com/scylladb/scylladb/issues/16709.

Seems unlikely, because the symptoms are completely opposite --- here we have too many reconnects and in that other issue we have 0 reconnects (i.e. https://github.com/scylladb/python-driver/issues/230).

But who knows, maybe it's related.

However, reproducing the issue is very tough (@sylwiaszunejko managed to reproduce it once every 1h or less often). @kbr-scylla ping, do you happen to know a good reproducer?

No.

kbr-scylla commented 5 months ago

@avelanarius @sylwiaszunejko one test that is failing often in CI is here https://github.com/scylladb/scylladb/issues/17353 but it's a bit different situation, it's a single-node test and we don't restart the node, but disable and re-enable the CQL server, which causes driver reconnection. The test then waits until driver reconnects and executes a query. Unfortunately driver may reconnect again (?) causing query to fail... Or perhaps the query is attempted on a different shard that hasn't connected yet?

kbr-scylla commented 5 months ago

We have a good reproducer https://github.com/scylladb/scylladb/issues/17353#issuecomment-1970969872

kbr-scylla commented 5 months ago

Branch: https://github.com/kbr-scylla/scylladb/tree/connection-reconnect-repro compile in debug mode and run

./test.py --mode debug --cpus=0,1 --jobs=32 test_native_transport --repeat 100

note: this is a single-node test

kbr-scylla commented 5 months ago

Updated the branch, reducing shard number on Scylla to 1 run with

./test.py --mode debug --cpus=0 --jobs=32 test_native_transport --repeat 100

it reproduces even more often!

So apparently, it's not (only) the problem of having connections to multiple shards. Even with single shard the driver is still dropping connection

kbr-scylla commented 5 months ago

Note: The reproducer mentioned above is for a single-node case, where the node disables the CQL server and re-enables it. We bisected the regression for that single-node case to d735957e3a7b7178f4fff26f6dbae588e58cf314.

However, originally, this issue was reported due to https://github.com/scylladb/scylladb/issues/14746#issuecomment-1713739316

There it was a rolling restart of the entire cluster. But that was reported on Jul 18 --- so before d735957e3a7b7178f4fff26f6dbae588e58cf314.

So d735957e3a7b7178f4fff26f6dbae588e58cf314 cannot be the root cause of https://github.com/scylladb/scylladb/issues/14746#issuecomment-1713739316 which triggered this issue originally.

avelanarius commented 5 months ago

Reopening the issue since there still might a problem for multi-node clusters

kbr-scylla commented 2 months ago

This is reproducing as can be seen in https://github.com/scylladb/scylladb/issues/19154#issuecomment-2163506387

In fact, it happens even if, after restarting nodes, we start a completely new driver session and verify that it can perform queries with wait_for_cql_and_get_hosts -- the new session decides to recreate connection pools immediately after, causing subsequent queries to fail.

This is probably caused by notifications arising from Scylla.

The driver should be sufficiently smart to recognize obsolete notifications -- i.e. if it gets a notification about node restart, but it has already reconnected after that restart, then it should not reconnect again.

If this requires the notification messages originating at Scylla to be augmented with some additional info, we can discuss this and perhaps open new feature request in Scylla. But perhaps it's not necessary. Maybe there is a way on the driver side to compare order of events based on already available information.