test_request_too_large[SSL] get stuck on endloop of trying to connect to scylla

fruch commented 9 months ago

test goes into a loop like the following:

01:34:30,919 790     cassandra.cluster              WARNING  thread.py           :58   | test_request_too_large[SSL]: Node 127.0.46.1:9042 is reporting a schema disagreement: None
01:34:59,357 790     cassandra.connection           WARNING  threading.py        :1038 | test_request_too_large[SSL]: Heartbeat failed for connection (140630522150416) to 127.0.46.1:9042
01:35:04,358 790     cassandra.cluster              WARNING  thread.py           :58   | test_request_too_large[SSL]: [control connection] Error connecting to 127.0.46.1:9042:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3607, in cassandra.cluster.ControlConnection._connect_host_in_lbp
  File "cassandra/cluster.py", line 3653, in cassandra.cluster.ControlConnection._try_connect
  File "cassandra/cluster.py", line 1689, in cassandra.cluster.Cluster.connection_factory
  File "cassandra/connection.py", line 885, in cassandra.connection.Connection.factory
cassandra.OperationTimedOut: errors=Timed out creating connection (5 seconds), last_host=None
01:35:09,360 790     cassandra.cluster              WARNING  thread.py           :58   | test_request_too_large[SSL]: [control connection] Error connecting to 127.0.46.1:9042:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3607, in cassandra.cluster.ControlConnection._connect_host_in_lbp
  File "cassandra/cluster.py", line 3653, in cassandra.cluster.ControlConnection._try_connect
  File "cassandra/cluster.py", line 1689, in cassandra.cluster.Cluster.connection_factory
  File "cassandra/connection.py", line 885, in cassandra.connection.Connection.factory
cassandra.OperationTimedOut: errors=Timed out creating connection (5 seconds), last_host=None
01:35:14,361 790     cassandra.cluster              WARNING  thread.py           :58   | test_request_too_large[SSL]: [control connection] Error connecting to 127.0.46.1:9042:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3607, in cassandra.cluster.ControlConnection._connect_host_in_lbp
  File "cassandra/cluster.py", line 3653, in cassandra.cluster.ControlConnection._try_connect
  File "cassandra/cluster.py", line 1689, in cassandra.cluster.Cluster.connection_factory
  File "cassandra/connection.py", line 885, in cassandra.connection.Connection.factory
cassandra.OperationTimedOut: errors=Timed out creating connection (5 seconds), last_host=None

fruch commented 9 months ago

I can see it locally (but not 100% of the times), trying to bisect if it's a specific scylla commit that introduced it.

fruch commented 9 months ago

reproducer:

pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:2023-09-26T21:50:54Z --log-cli-level=debug -n 2 --count=10

fruch commented 9 months ago

reproduce on:

5.4.0~dev-0.20230924.5e8b7a9cef2d with build-id cd4b337695d964e34e73439c81966ef6edd14ee9
5.4.0~dev-0.20230925.6bd489e9507b with build-id 77c377eeb37bd43190206d4d83d74b9a399d86b3
2023.1.1

seems like it's an issue can happen regardless of which scylla version.

elcallio commented 9 months ago

I can't repro this, neither locally (enterprise) or in docker (2023.1). Do we have CI runs where this happened?

fruch commented 9 months ago

I can't repro this, neither locally (enterprise) or in docker (2023.1). Do we have CI runs where this happened?

we are seeing it on master, and as I wrote above with enough runs and parallelism of the same test is reproduces locally for me

happend on: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/386/ to https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/389/

It's a bit hard to see it on CI, since it case the timeout of the whole tests, but on here you can see the node log of it: https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/389/artifact/logs-full.release.000/orphaned/dtest-2i5lu3jo/

elcallio commented 9 months ago

Well, the log indicates that the CQL connection gets a ENOTCONN on the scylla side, right after the new table is created. This also happens when the test succeeds, and should maybe be considered a part of the CONNRESET logic...

Why the next connection attempt fails is a mystery. I am leaning on the client side being the issue, but as I still cannot repro this in either relocated or local env, I really don't know.

I am also perplexed by the: Node 127.0.46.1:9042 is reporting a schema disagreement: None in your/the failing log. This is not expected in the log.

fruch commented 9 months ago

I've run it locally a few times, and my conclusion it's a driver bug

failed one:

21:43:31,906 69      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: Received a connection 140560399457232 for shard_id=0 on host 127.0.88.1:9042
21:43:31,910 69      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: New connection (140560399457232) created to shard_id=0 on host 127.0.88.1:9042
21:43:32,119 69      cassandra.io.libevreactor      DEBUG    libevreactor.py     :372  | test_request_too_large[SSL-13-15]: Connection <LibevConnection(140560398787984) 127.0.88.1:9042> closed by server
21:43:32,119 69      cassandra.io.libevreactor      DEBUG    libevreactor.py     :287  | test_request_too_large[SSL-13-15]: Closing connection (140560398787984) to 127.0.88.1:9042
21:43:32,119 69      cassandra.io.libevreactor      DEBUG    libevreactor.py     :291  | test_request_too_large[SSL-13-15]: Closed socket to 127.0.88.1:9042
21:43:32,120 69      cassandra.pool                 DEBUG    libevreactor.py     :295  | test_request_too_large[SSL-13-15]: Defunct or closed connection (140560398787984) returned to pool, potentially marking host 127.0.88.1:9042 as down
21:43:32,120 69      cassandra.pool                 DEBUG    libevreactor.py     :295  | test_request_too_large[SSL-13-15]: Shutting down connections to 127.0.88.1:9042
21:43:32,720 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Waiting for schema agreement
21:43:38,721 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:43:44,726 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:43:50,729 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:43:56,730 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:01,711 69      cassandra.connection           DEBUG    threading.py        :1038 | test_request_too_large[SSL-13-15]: Cannot send heartbeat message on connection (140560398787984) to 127.0.88.1:9042
21:44:02,731 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:08,733 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:14,735 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:20,736 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:26,737 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:31,713 69      cassandra.connection           DEBUG    threading.py        :1038 | test_request_too_large[SSL-13-15]: Cannot send heartbeat message on connection (140560398787984) to 127.0.88.1:9042
21:44:31,714 69      cassandra.connection           DEBUG    threading.py        :1038 | test_request_too_large[SSL-13-15]: Sending options message heartbeat on idle connection (140560390349456) 127.0.88.1:9042
21:44:32,720 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:32,721 69      cassandra.cluster              WARNING  thread.py           :58   | test_request_too_large[SSL-13-15]: Node 127.0.88.1:9042 is reporting a schema disagreement: None
21:44:32,722 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: Skipping schema refresh due to lack of schema agreement
21:44:32,733 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Waiting for schema agreement
21:44:38,737 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:44,739 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:50,740 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:44:56,741 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Timed out waiting for response during schema agreement check: errors=None, last_host=None
21:45:01,715 69      cassandra.connection           WARNING  threading.py        :1038 | test_request_too_large[SSL-13-15]: Heartbeat failed for connection (140560390349456) to 127.0.88.1:9042
21:45:01,715 69      cassandra.connection           DEBUG    threading.py        :1038 | test_request_too_large[SSL-13-15]: Defuncting connection (140560390349456) to 127.0.88.1:9042: errors=Connection heartbeat timeout after 30 seconds, last_host=127.0.88.1:9042
21:45:01,716 69      cassandra.io.libevreactor      DEBUG    libevreactor.py     :287  | test_request_too_large[SSL-13-15]: Closing connection (140560390349456) to 127.0.88.1:9042
21:45:01,716 69      cassandra.io.libevreactor      DEBUG    libevreactor.py     :291  | test_request_too_large[SSL-13-15]: Closed socket to 127.0.88.1:9042
21:45:01,717 69      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-13-15]: [control connection] Error refreshing schema

working one:

21:43:45,413 66      limits_test                    INFO     limits_test.py      :377  | test_request_too_large[SSL-7-15]: Trying to send a large request to database (insert a large string into table)...
21:43:45,507 66      cassandra.protocol_features    DEBUG    protocol_features.py:56   | test_request_too_large[SSL-7-15]: Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042']}
21:43:45,507 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Received options response on new connection (140588010907728) from 127.0.29.1:9042
21:43:45,507 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
21:43:45,508 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Sending StartupMessage on <LibevConnection(140588010907728) 127.0.29.1:9042>
21:43:45,508 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Sent StartupMessage on <LibevConnection(140588010907728) 127.0.29.1:9042>
21:43:45,508 66      cassandra.io.libevreactor      DEBUG    libevreactor.py     :372  | test_request_too_large[SSL-7-15]: Connection <LibevConnection(140588240189008) 127.0.29.1:9042> closed by server
21:43:45,508 66      cassandra.io.libevreactor      DEBUG    libevreactor.py     :287  | test_request_too_large[SSL-7-15]: Closing connection (140588240189008) to 127.0.29.1:9042
21:43:45,508 66      cassandra.io.libevreactor      DEBUG    libevreactor.py     :291  | test_request_too_large[SSL-7-15]: Closed socket to 127.0.29.1:9042
21:43:45,508 66      cassandra.pool                 DEBUG    libevreactor.py     :295  | test_request_too_large[SSL-7-15]: Defunct or closed connection (140588240189008) returned to pool, potentially marking host 127.0.29.1:9042 as down
21:43:45,508 66      cassandra.pool                 DEBUG    libevreactor.py     :295  | test_request_too_large[SSL-7-15]: Shutting down connections to 127.0.29.1:9042
21:43:45,508 66      cassandra.pool                 DEBUG    libevreactor.py     :295  | test_request_too_large[SSL-7-15]: Closing connection (140588240189008) to 127.0.29.1:9042
21:43:45,508 66      cassandra.cluster              WARNING  libevreactor.py     :295  | test_request_too_large[SSL-7-15]: Host 127.0.29.1:9042 has been marked down
21:43:45,508 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: [control connection] Control connection host (127.0.29.1:9042) is considered down, starting reconnection
21:43:45,508 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Removed connection pool for <Host: 127.0.29.1:9042 datacenter1>
21:43:45,509 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Starting reconnector for host 127.0.29.1:9042
21:43:45,509 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: [control connection] Attempting to reconnect
21:43:45,509 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Got ReadyMessage on new connection (140588010907728) from 127.0.29.1:9042
21:43:45,509 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Host 127.0.29.1:9042 is now marked up
21:43:45,509 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Received a connection 140588010907728 for shard_id=0 on host 127.0.29.1:9042
21:43:45,510 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: [control connection] Opening new connection to 127.0.29.1:9042
21:43:45,510 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Pool for host 127.0.29.1:9042 is in shutdown, closing the new connection (140588010907728)
21:43:45,510 66      cassandra.io.libevreactor      DEBUG    libevreactor.py     :287  | test_request_too_large[SSL-7-15]: Closing connection (140588010907728) to 127.0.29.1:9042
21:43:45,510 66      cassandra.io.libevreactor      DEBUG    libevreactor.py     :291  | test_request_too_large[SSL-7-15]: Closed socket to 127.0.29.1:9042
21:43:45,511 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Shutting down connections to 127.0.29.1:9042
21:43:45,511 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Initializing connection for host 127.0.29.1:9042
21:43:45,512 66      cassandra.cluster              DEBUG    dtest_setup.py      :436  | test_request_too_large[SSL-7-15]: Connecting to cluster, contact points: ['127.0.29.1']; protocol version: 4
21:43:45,512 66      cassandra.pool                 DEBUG    dtest_setup.py      :436  | test_request_too_large[SSL-7-15]: Host 127.0.29.1:9042 is now marked up
21:43:45,513 66      cassandra.cluster              DEBUG    dtest_setup.py      :436  | test_request_too_large[SSL-7-15]: [control connection] Opening new connection to 127.0.29.1:9042
21:43:45,514 66      cassandra.connection           DEBUG    libevreactor.py     :267  | test_request_too_large[SSL-7-15]: Connection 140588010976016 127.0.0.1:40320 -> 127.0.29.1:9042
21:43:45,515 66      cassandra.connection           DEBUG    libevreactor.py     :274  | test_request_too_large[SSL-7-15]: Sending initial options message for new connection (140588010976016) to 127.0.29.1:9042
21:43:45,515 66      cassandra.connection           DEBUG    libevreactor.py     :267  | test_request_too_large[SSL-7-15]: Connection 140588011015248 127.0.0.1:40332 -> 127.0.29.1:9042
21:43:45,516 66      cassandra.connection           DEBUG    libevreactor.py     :274  | test_request_too_large[SSL-7-15]: Sending initial options message for new connection (140588011015248) to 127.0.29.1:9042
21:43:45,519 66      cassandra.connection           DEBUG    libevreactor.py     :267  | test_request_too_large[SSL-7-15]: Connection 140588011024912 127.0.0.1:40348 -> 127.0.29.1:9042
21:43:45,519 66      cassandra.connection           DEBUG    libevreactor.py     :274  | test_request_too_large[SSL-7-15]: Sending initial options message for new connection (140588011024912) to 127.0.29.1:9042
21:43:45,560 66      cassandra.protocol_features    DEBUG    protocol_features.py:56   | test_request_too_large[SSL-7-15]: Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['0'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042']}
21:43:45,560 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Received options response on new connection (140588011015248) from 127.0.29.1:9042
21:43:45,560 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
21:43:45,560 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Sending StartupMessage on <LibevConnection(140588011015248) 127.0.29.1:9042>
21:43:45,560 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Sent StartupMessage on <LibevConnection(140588011015248) 127.0.29.1:9042>
21:43:45,561 66      cassandra.protocol_features    DEBUG    protocol_features.py:56   | test_request_too_large[SSL-7-15]: Parsing sharding info from message options {'COMPRESSION': ['lz4', 'snappy'], 'SCYLLA_LWT_ADD_METADATA_MARK': ['LWT_OPTIMIZATION_META_BIT_MASK=2147483648'], 'SCYLLA_NR_SHARDS': ['2'], 'SCYLLA_PARTITIONER': ['org.apache.cassandra.dht.Murmur3Partitioner'], 'SCYLLA_RATE_LIMIT_ERROR': ['ERROR_CODE=61440'], 'SCYLLA_SHARD': ['1'], 'SCYLLA_SHARDING_ALGORITHM': ['biased-token-round-robin'], 'SCYLLA_SHARDING_IGNORE_MSB': ['12'], 'SCYLLA_SHARD_AWARE_PORT': ['19042']}
21:43:45,561 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Received options response on new connection (140588010976016) from 127.0.29.1:9042
21:43:45,561 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: No available compression types supported on both ends. locally supported: odict_keys([]). remotely supported: ['lz4', 'snappy']
21:43:45,561 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Sending StartupMessage on <LibevConnection(140588010976016) 127.0.29.1:9042>
21:43:45,561 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Sent StartupMessage on <LibevConnection(140588010976016) 127.0.29.1:9042>
21:43:45,561 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Got ReadyMessage on new connection (140588011015248) from 127.0.29.1:9042
21:43:45,561 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: First connection created to 127.0.29.1:9042 for shard_id=0
21:43:45,561 66      cassandra.connection           DEBUG    libevreactor.py     :370  | test_request_too_large[SSL-7-15]: Got ReadyMessage on new connection (140588010976016) from 127.0.29.1:9042
21:43:45,561 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: [control connection] Established new connection <LibevConnection(140588010976016) 127.0.29.1:9042>, registering watchers and refreshing schema and topology
21:43:45,562 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Finished initializing connection for host 127.0.29.1:9042
21:43:45,562 66      cassandra.cluster              DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: Added pool for host 127.0.29.1:9042 to session
21:43:45,562 66      cassandra.pool                 DEBUG    thread.py           :58   | test_request_too_large[SSL-7-15]: shard_aware_endpoint=None

looks like there are case the driver doesn't always reconnect after the connection closed by scylla (cause of the too big request) and keep on fails on trying to connect, not clear why.

@roydahan @mykaul, can you move this one to scylladb/python-driver ?

fruch commented 9 months ago

after debug it more on the driver end, seems like it's happening as the following 1) we connect to one shard 2) 2nd shard is starting connection on the background 3) we send out the big request that goes to the 2nd connection, and scylla closes it, while are inside set_keyspace_blocking and holding the pool lock

also, seems like we are doing a query inside set_keyspace_blocking without timeout

avikivity commented 9 months ago

Aha, very tricky.

avikivity commented 9 months ago

Any idea why the connection is closed by the server? Of course, the driver should be prepared for it.

fruch commented 9 months ago

Any idea why the connection is closed by the server? Of course, the driver should be prepared for it.

Well the test is sending a huge request on purpose, but the connection getting stuck here isn't the connection that should be closed cause of the too big request.

So it's not sure to me yet why we are seeing it in the SSL case only BTW.

The hunch is that we recently started using one ssl_context, so it might be affecting it too.

Anyhow just putting a timeout on the stuck request is fixing the situation. And each request should have some timeout, especially if used within a lock.

Just need to see it doesn't break all other cases.

fruch commented 9 months ago

and the plot thickens a bit,

we recent update the ssl version we are using in this test from ssl.PROTOCOL_TLSv1_2 to sl.PROTOCOL_TLS_CLIENT (which mean lastet i.e. v3)

and with ssl.PROTOCOL_TLSv1_2 I can't reproduce the issue at all

so there is something relate to the TLS protocol that cause this change of behavior...

avikivity commented 9 months ago

@elcallio any idea what's going on?

elcallio commented 9 months ago

Is there a driver version where things work, and one where it does not? Again, I cannot repro this at all. PROTOCOL_TLS_CLIENT should mean we negotiate to TLSv3, but I don't see how that would impact detection or disconnect or error propagation. IIUC, python SSL is just an openssl wrapper.

fruch commented 9 months ago

looking at seastar recent changes, this change might be affecting the the closing of connections logic ? scylladb/seastar@7164ae143e0d0493a1c8bd91b45e9c0c87ab5deb

bhalevy commented 9 months ago

Cc @xemul

fruch commented 9 months ago

Is there a driver version where things work, and one where it does not? Again, I cannot repro this at all. PROTOCOL_TLS_CLIENT should mean we negotiate to TLSv3, but I don't see how that would impact detection or disconnect or error propagation. IIUC, python SSL is just an openssl wrapper.

It happen with the latest driver version 3.26.3, and latest scylla master

it happens about 1/10 like that:

pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:latest --log-cli-level=debug -n 2 --count=10

you can play with the n or count to make it happen a bit more.

I can easily reproduce it locally, multiple times like that

elcallio commented 9 months ago

looking at seastar recent changes, this change might be affecting the the closing of connections logic ? scylladb/seastar@7164ae1

I think that change is safe, unless some code in scylla decided to actually turn off wait-for-eof...

elcallio commented 9 months ago

It happen with the latest driver version 3.26.3, and latest scylla master

it happens about 1/10 like that:
pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:latest --log-cli-level=debug -n 2 --count=10
you can play with the n or count to make it happen a bit more.

I can easily reproduce it locally, multiple times like that

Running the same, iterations of 10, 20 more... No result what so ever. Same driver, same scylla. Weird. Most likely a network setup difference...

fruch commented 9 months ago

It happen with the latest driver version 3.26.3, and latest scylla master

it happens about 1/10 like that:
pytest limits_test.py::TestLimits::test_request_too_large -k "[SSL" --scylla-version=unstable/master:latest --log-cli-level=debug -n 2 --count=10
you can play with the n or count to make it happen a bit more.

I can easily reproduce it locally, multiple times like that
Running the same, iterations of 10, 20 more... No result what so ever. Same driver, same scylla. Weird. Most likely a network setup difference...

Maybe different openssl versions Try running with docker

elcallio commented 9 months ago

I did. :-) Inside and outside the dtest container.

fruch commented 9 months ago

https://github.com/scylladb/scylla-dtest/pull/3670 is hiding this issue for now, as far as I can see.

TODO: run the test with tcpdump, so we working network dump, and failing network dump, to compare

fruch commented 9 months ago

and now with tcpdumps:

aws s3 cp s3://scylla-qa-public/fruch/python-driver-issue-261/log_and_tcpdump.tar.gz .

the failed dump:

logs/failure/ccm_node.crl
logs/failure/ccm_node.cer
logs/failure/ccm_node.pem
logs/failure/ccm_node.key
logs/failure/tcpdump__127.0.29.1_19042.pcap
logs/failure/tcpdump__127.0.29.1_9042.pcap

two example of the test passing

logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/node1.log
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.key
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.crl
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.cer
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/tcpdump__127.0.62.1_19042.pcap
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/node1_system.log.jmx
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/ccm_node.pem
logs/1697665964613_limits_test.py::TestLimits::test_request_too_large[SSL-8-10]/tcpdump__127.0.62.1_9042.pcap
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/tcpdump__127.0.62.1_19042.pcap
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/node1_system.log.jmx
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.key
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.pem
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.cer
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/ccm_node.crl
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/tcpdump__127.0.62.1_9042.pcap
logs/1697665950976_limits_test.py::TestLimits::test_request_too_large[SSL-10-10]/node1.log

fruch commented 9 months ago

I figured I've captured only one direction of the communication, so here's one that capture both directions (similar directory structure):

aws s3 cp s3://scylla-qa-public/fruch/python-driver-issue-261/log_and_tcpdump_2_direction.tar.gz

fruch commented 9 months ago

can also be download by https:

wget https://scylla-qa-public.s3.amazonaws.com/fruch/python-driver-issue-261/log_and_tcpdump_2_direction.tar.gz

fruch commented 1 month ago

@elcallio

did you had a change to look at those tcp dumps ?

mykaul commented 1 month ago

@fruch - did you manage to decode those?

INFO  2023-10-18 21:52:33,782 [shard 1:stat] cql_server - exception while processing connection: std::system_error (error system:107, Transport endpoint is not connected)
WARN  2023-10-18 21:52:33,983 [shard 0:stat] seastar_memory - oversized allocation: 8388608 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62c05 0x15668fc 0x27d961c 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN  2023-10-18 21:52:33,993 [shard 0:stat] seastar_memory - oversized allocation: 1970176 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN  2023-10-18 21:52:33,997 [shard 0:stat] seastar_memory - oversized allocation: 3936256 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN  2023-10-18 21:52:34,003 [shard 0:stat] seastar_memory - oversized allocation: 7868416 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------

fruch commented 1 month ago

@fruch - did you manage to decode those?

INFO  2023-10-18 21:52:33,782 [shard 1:stat] cql_server - exception while processing connection: std::system_error (error system:107, Transport endpoint is not connected)
WARN  2023-10-18 21:52:33,983 [shard 0:stat] seastar_memory - oversized allocation: 8388608 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62c05 0x15668fc 0x27d961c 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN  2023-10-18 21:52:33,993 [shard 0:stat] seastar_memory - oversized allocation: 1970176 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN  2023-10-18 21:52:33,997 [shard 0:stat] seastar_memory - oversized allocation: 3936256 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------
   seastar::lambda_task<seastar::execution_stage::flush()::$_0>
WARN  2023-10-18 21:52:34,003 [shard 0:stat] seastar_memory - oversized allocation: 7868416 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x5fa1cbe 0x5fa2280 0x5fa2558 0x5a5ffa6 0x5a602f1 0x5a62ff1 0x5284faf 0x53536b8 0x5351dd2 0x52a3494 0x52aef67 0x33e49a5 0x2da2cac 0x339913f 0x27d8bad 0x27d6ec8 0x27f5268 0x27cb602 0x2828ae4 0x2828a35 0x283df5f 0x283d8ab 0x283cc31 0x283be2d 0x5a4fd82 0x5aa4edf 0x5aa61b7 0x5aa5529 0x5a47e07 0x5a46fbc 0x1315b0f 0x1317580 0x131405c /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27b89 /home/fruch/.ccm/scylla-repository/unstable/master/latest/libreloc/libc.so.6+0x27c4a 0x1311ca4
   --------

they are irrelevant, this test is intetantilly create hugh requests, so those are excepted, it's not expected for the client to stuck cause of those.

scylladb / python-driver

test_request_too_large[SSL] get stuck on endloop of trying to connect to scylla #261