scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Number of connections established by Scylla Manager outgrows system capabilities #3769

Closed zimnx closed 3 months ago

zimnx commented 3 months ago

In one of QA suites for Scylla Operator test scheduled 20 ad-hoc repairs for single cluster and then Scylla Operator tried to update most of them. Manager wasn't able to validate the tasks, as it couldn't connect to healthy Scylla nodes due to lack of sockets.

2024-03-25T16:52:35.149228320Z {"L":"INFO","T":"2024-03-25T16:52:35.148Z","N":"cluster.client","M":"HTTP retry backoff","operation":"ColumnFamilyMetricsTotalDiskSpaceUsedByNameGet","wait":"1s","error":"dial tcp 172.20.163.164:10001: connect: cannot assign requested address","_trace_id":"rUqZzZC4RhWCQVPNkEdIHg"}

There're a lot of logs suggesting that new http client is created.

2024-03-25T17:07:31.364346000Z {"L":"INFO","T":"2024-03-25T17:07:31.364Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"b4f10047-8d7f-490c-b829-5f60b5b56a1f","_trace_id":"cHtrJ-8rQlWJl-Rt9CMYJg"}

Looks like a http client/connection leak or unbound http client.

Run: https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.12/job/eks/job/longevity-scylla-operator-3h-eks-repair-test/3/ Logs: https://cloudius-jenkins-test.s3.amazonaws.com/b1c45503-3f4f-4197-ae76-359835e2c0b2/20240325_174610/kubernetes-must-gather-b1c45503.tar.gz Scylla Manager logs are under must-gather/namespaces/scylla-manager/pods/scylla-manager-6f6556fd64-jcrrx

Michal-Leszczynski commented 3 months ago

Even though SM allows for only a single repair running at the moment, it allows for generating multiple targets at the same time (they are needed for task validation required for new/updated tasks). Getting repair target creates cql session to cluster (if credentials are available) and scylla client. From the logs it looks like SM didn't have credentials, so cql sessions shouldn't be created.

Usually SM caches scylla clients to given cluster, so there shouldn't be an issue. The problem is SM cql session creation. In order for SM to create cql session, SM needs to check nodeInfo for cql_password_protected field. Changes introduced in b136689d made it so SM doesn't use cached client, but always creates a fresh one for fetching required nodeInfo. So basically, with or without credentials, SM always creates fresh scylla client on task creation/update which could be the ~root cause of this issue~ reason why this issue is visible. The root cause is probably connected to not closing created clients.

cc: @karol-kokoszka

karol-kokoszka commented 3 months ago

@zimnx @Michal-Leszczynski Indeed, session creation didn't close the client ... argh...