Connection to a ScyllaDB cluster is delayed as it tries with unsupported versions first

mykaul commented 1 year ago

Looking at a pcap, the ScyllaDB Python driver tries to connect with versions we do not support - 66 (DSE_V2), 65 (DSE_V1), 5. That means 3 connections that are responded with an ERROR, until we succeed to connect. Each connection is of course the whole 3 way TCP conn, then closed, etc... cc @fruch (I've seen it in cqlsh, now I understand it's due to the Python driver defaults)

mykaul commented 1 year ago

We can (should?) also change:

SUPPORTED_VERSIONS = (66, 65, 6, 5, 4, 3, 2, 1)

To what we actually support.

Lorak-mmk commented 1 year ago

We can change the order in which protocols are tried, I'll try to get to it soon. However, I wanted to note that this doesn't really matter if there aren't many connections (like when using clqsh). Performance of connecting will matter if there are a lot of connections made at the same time. And it that case, protocol_version should be explicitly set when constructing Cluster object instead of relying on driver guessing the version - if our tests or clients don't do it, they should be instructed to. If someone for some reason needs connection setup performance while using cqlsh - it has a --protocol-version parameter which should be used in that case.

Lorak-mmk commented 1 year ago

We can (should?) also change:
SUPPORTED_VERSIONS = (66, 65, 6, 5, 4, 3, 2, 1)
To what we actually support.

We only test protocol v4 in CI, but I think all the others should be supported by driver (I may be wrong here) - which ones do you think should be removed from this list?

mykaul commented 1 year ago

We (ScyllaDB) do not support version >4, but I reckon you raise an interesting argument - what if someone wants to use our driver against both ScyllaDB and some DSE version... I don't think we test or should focus on the usecase. I prefer faster connection out of the box.

mykaul commented 1 year ago

We can change the order in which protocols are tried, I'll try to get to it soon. However, I wanted to note that this doesn't really matter if there aren't many connections (like when using clqsh). Performance of connecting will matter if there are a lot of connections made at the same time. And it that case, protocol_version should be explicitly set when constructing Cluster object instead of relying on driver guessing the version - if our tests or clients don't do it, they should be instructed to. If someone for some reason needs connection setup performance while using cqlsh - it has a --protocol-version parameter which should be used in that case.

Yes, I was lazy so used cqlsh, but the issue is in the Python driver (and perhaps others?) and I'm looking at the impact when hundreds or thousands of connection attempts are made. The argument about specifying the exact protocol version is valid, but assumes someone takes this into consideration in the first place.

nyh commented 1 year ago

An easy solution is put the version that Scylla supports (4) as the first one in the long list.

But this change may break a guarantee made in the driver documentation https://docs.datastax.com/en/developer/python-driver/3.25/api/cassandra/cluster/:

protocol_version = 66 - The maximum version of the native protocol to use. If not set in the constructor, the driver will automatically downgrade version based on a negotiation with the server, but it is most efficient to set this to the maximum supported by your version of Cassandra. Setting this will also prevent conflicting versions negotiated if your cluster is upgraded.

This text suggests that the driver defaults to trying version 66 and then downgrading. If it starts with 4, it will never try anything higher. This text suggest that it's the user's responsibility to set the protocol_version parameter in the cluster (and as @Lorak-mmk said, we indeed do this in test/cql-pytest/util.py and other places).

We could, of course, change this documentation to explain we actually always use 4, and if you want a higher protocol, you should set it manually.

By the way, we have Scylla issues on how Scylla handles any protocol versions except 4 incorrectly :( See https://github.com/scylladb/scylladb/issues/9791 and https://github.com/scylladb/scylladb/issues/11899

mykaul commented 1 year ago

This text suggests that the driver defaults to trying version 66 and then downgrading. If it starts with 4, it will never try anything higher. This text suggest that it's the user's responsibility to set the protocol_version parameter in the cluster (and as @Lorak-mmk said, we indeed do this in test/cql-pytest/util.py and other places).

The above is not our driver's text. There's a good open question about ScyllaDB driver support for Cassandra latest versions. I assume 'not tested' (certainly not for their enterprise versions!) is an accurate statement, though in reality we are probably fine...

nyh commented 1 year ago

The above is not our driver's text.

There's a separate argument why we maintain our own driver with random improvements (and not just Scylla-specific features), and whether Scylla users must use the Scylla verision of the driver for every language (do we even have a Scylla version of every language?). I guess this issue is not the place to hold this discussion.

There's a good open question about ScyllaDB driver support for Cassandra latest versions. I assume 'not tested' (certainly not for their enterprise versions!) is an accurate statement, though in reality we are probably fine...

I don't think anybody will seriously consider using Scylla's driver for running against Cassandra, so the answer to this question is not important.

Lorak-mmk commented 1 year ago

The above is not our driver's text. There's a good open question about ScyllaDB driver support for Cassandra latest versions. I assume 'not tested' (certainly not for their enterprise versions!) is an accurate statement, though in reality we are probably fine...

I don't think anybody will seriously consider using Scylla's driver for running against Cassandra, so the answer to this question is not important.

Please see: https://github.com/scylladb/python-driver/blob/master/docs/upgrading.rst I'll paste the relevant fragment below:

Since 3.21.0, scylla-driver fully supports DataStax products. dse-driver and
dse-graph users should now migrate to scylla-driver to benefit from latest bug fixes
and new features. The upgrade to this new unified driver version is straightforward
with no major API changes.

Changing the default version of the protocol would break backwards compatibility (because our docs say what version is the default, as @nyh noticed) and it would also break above promise.

There's a separate argument why we maintain our own driver with random improvements (and not just Scylla-specific features), and whether Scylla users must use the Scylla verision of the driver for every language (do we even have a Scylla version of every language?). I guess this issue is not the place to hold this discussion.

We don't - most notably, we don't have C# and Node.js drivers.

fruch commented 1 year ago

@Lorak-mmk

the docs mostly copy-paste from upstream with search and replace for the driver name, we did removed lots of parts, this part can be removed as well.

I don't think we need to support or test DSE versions, if we are compatible with cassandra it's probably enough for our users.

Also using CQL v4, should be working with cassandra and DSE,

starting with the version we support best, sounds o.k., especially since user would be able to select anything else they might need, if then need any of the advanced features of those versions.

Lorak-mmk commented 1 year ago

If we are ok with not supporting DSE and dropping this promise from docs, then awesome - we can easily drop default protocol version to V5, and potentially remove a lot of DSE-specific code from the driver in the future. That would decrease failed attempts from 3 to 1 when connecting to Scylla without specifying protocol version. Dropping to V4 would still be a breaking change (as we want to be compatible with Cassandra), wouldn't it? Unless there are no new features there.

mykaul commented 1 year ago

@tzach , @annastuchlik - please see https://github.com/scylladb/python-driver/issues/244#issuecomment-1649731361 - this is probably a bug in our documentation.

mykaul commented 1 year ago

If we are ok with not supporting DSE and dropping this promise from docs, then awesome - we can easily drop default protocol version to V5, and potentially remove a lot of DSE-specific code from the driver in the future. That would decrease failed attempts from 3 to 1 when connecting to Scylla without specifying protocol version. Dropping to V4 would still be a breaking change (as we want to be compatible with Cassandra), wouldn't it? Unless there are no new features there.

The downside is that syncing with upstream changes are going to be more challenging - either you resolve conflicts, or you (re)apply patches after syncing (the latter is easier?) - but it becomes a process. I assume we have the former today, in some cases, anyway.

Lorak-mmk commented 1 year ago

This issue affects control connection, not any of subsequent connections. So if your client connects to a cluster which has 8 nodes, 32 shards each, then the control connection will have 3 failures, as described, but the rest of shard connections (256) should connect at first try. Initially after reading description I thought that each connection is affected, so I'm noting it here in case anyone else misunderstands.

mykaul commented 1 year ago

This issue affects control connection, not any of subsequent connections. So if your client connects to a cluster which has 8 nodes, 32 shards each, then the control connection will have 3 failures, as described, but the rest of shard connections (256) should connect at first try. Initially after reading description I thought that each connection is affected, so I'm noting it here in case anyone else misunderstands.

Isn't it per node given (the control connection)? Or once an initial connection is made, for the very 1st time, everything else works with the correct version, across nodes and shards? (I've tried with cqlsh against a single shard, single node docker instance...)

mykaul commented 1 year ago

(The change in cqlsh is https://github.com/scylladb/scylla-cqlsh/pull/45 )

Lorak-mmk commented 1 year ago

After reading the code I'm pretty sure it's once per Cluster object - so only one connection, not one per node, and if you create multiple Session objects from one Cluster, only the first one will encounter this issue. I'll verify this in Wireshark later.

mykaul commented 1 year ago

After reading the code I'm pretty sure it's once per Cluster object - so only one connection, not one per node, and if you create multiple Session objects from one Cluster, only the first one will encounter this issue. I'll verify this in Wireshark later.

If that's the case, it is really benign and not worth fixing (as it'll introduce conflict with upstream on every rebase and I've offered a fix for cqlsh already)

mykaul commented 1 year ago

Not worth handling, per the above discussion - closing.

scylladb / python-driver

Connection to a ScyllaDB cluster is delayed as it tries with unsupported versions first #244