scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 93 forks source link

C-S Not using advanced port-based shard awareness #5853

Open soyacz opened 1 year ago

soyacz commented 1 year ago

Issue description

Recently we can observe warnings in c-s like this:

===== Using optimized driver!!! =====
Connected to cluster: longevity-5000-tables-2023-1-db-cluster-d89f20f6, max pending requests per connection null, max connections per host 8
Datatacenter: eu-west; Host: /10.4.2.130; Rack: 1a
Datatacenter: eu-west; Host: /10.4.0.228; Rack: 1a
Datatacenter: eu-west; Host: /10.4.2.158; Rack: 1a
Datatacenter: eu-west; Host: /10.4.0.238; Rack: 1a
Datatacenter: eu-west; Host: /10.4.1.23; Rack: 1a
Datatacenter: eu-west; Host: /10.4.0.247; Rack: 1a
Created schema. Sleeping 6s for propagation.
WARN  12:42:59,057 Not using advanced port-based shard awareness with /10.4.1.23:9042 because we're missing port-based shard awareness port on the server
WARN  12:42:59,157 Not using advanced port-based shard awareness with /10.4.2.130:9042 because we're missing port-based shard awareness port on the server
WARN  12:42:59,173 Not using advanced port-based shard awareness with /10.4.0.228:9042 because we're missing port-based shard awareness port on the server
WARN  12:42:59,198 Not using advanced port-based shard awareness with /10.4.2.158:9042 because we're missing port-based shard awareness port on the server
WARN  12:42:59,207 Not using advanced port-based shard awareness with /10.4.0.238:9042 because we're missing port-based shard awareness port on the server
WARN  12:42:59,215 Not using advanced port-based shard awareness with /10.4.0.247:9042 because we're missing port-based shard awareness port on the server
Created extra schema. Sleeping 6s for propagation.

It is visible in all c-s logs, also in other jobs recently (this one might be hard to investigate as logs are big).

Impact

Possibly worse performance and not using important ScyllaDB feature in c-s.

How frequently does it reproduce?

I recently see it often.

Installation details

Kernel Version: 5.15.0-1028-aws Scylla version (or git commit hash): 2023.1.0~rc1-20230208.fe3cc281ec73 with build-id ff20df9822b5b6397724a6ff6caadde419b383e6

Cluster size: 1 nodes (i3.8xlarge)

Scylla Nodes used in this run:

OS / Image: ami-056165f482cc8e0d8 (aws: eu-west-1)

Test: scale-5000-tables-test Test id: d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda Test name: enterprise-2023.1/scale/scale-5000-tables-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda) - Show all stored logs command: `$ hydra investigate show-logs d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda` ## Logs: - **db-cluster-d89f20f6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/db-cluster-d89f20f6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/db-cluster-d89f20f6.tar.gz) - **email_data-d89f20f6.json.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/email_data-d89f20f6.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/email_data-d89f20f6.json.tar.gz) - **output-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/output-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/output-d89f20f6.log.tar.gz) - **debug-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/debug-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/debug-d89f20f6.log.tar.gz) - **events-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/events-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/events-d89f20f6.log.tar.gz) - **sct-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/sct-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/sct-d89f20f6.log.tar.gz) - **normal-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/normal-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/normal-d89f20f6.log.tar.gz) - **argus-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/argus-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/argus-d89f20f6.log.tar.gz) - **raw_events-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/raw_events-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/raw_events-d89f20f6.log.tar.gz) - **critical-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/critical-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/critical-d89f20f6.log.tar.gz) - **warning-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/warning-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/warning-d89f20f6.log.tar.gz) - **summary-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/summary-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/summary-d89f20f6.log.tar.gz) - **left_processes-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/left_processes-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/left_processes-d89f20f6.log.tar.gz) - **error-d89f20f6.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/error-d89f20f6.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/error-d89f20f6.log.tar.gz) - **monitor-set-d89f20f6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/monitor-set-d89f20f6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/monitor-set-d89f20f6.tar.gz) - **loader-set-d89f20f6.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/loader-set-d89f20f6.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/d89f20f6-1ec1-4bca-a7f2-1e8c1d0bacda/20230222_190116/loader-set-d89f20f6.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/scale/job/scale-5000-tables-test/2/)
fruch commented 1 year ago

@soyacz this might only affect the speed of connecting to all of the shard of a node, once it's connected there's no affect.

We need to check the configuration of the nodes, I think that for some reason we are not enabling correctly the share-aware port, it should be enabled by default

fgelcer commented 1 year ago

could it be because the docker's shard aware port is blocked or not mapped correctly?

fruch commented 1 year ago

could it be because the docker's shard aware port is blocked or not mapped correctly?

We are using host networking, we shouldn't have any issue regarding networking, we need to check scylla.yaml