Closed zimnx closed 3 months ago
Even though SM allows for only a single repair running at the moment, it allows for generating multiple targets at the same time (they are needed for task validation required for new/updated tasks). Getting repair target creates cql session to cluster (if credentials are available) and scylla client. From the logs it looks like SM didn't have credentials, so cql sessions shouldn't be created.
Usually SM caches scylla clients to given cluster, so there shouldn't be an issue. The problem is SM cql session creation. In order for SM to create cql session, SM needs to check nodeInfo for cql_password_protected
field. Changes introduced in b136689d made it so SM doesn't use cached client, but always creates a fresh one for fetching required nodeInfo. So basically, with or without credentials, SM always creates fresh scylla client on task creation/update which could be the ~root cause of this issue~ reason why this issue is visible. The root cause is probably connected to not closing created clients.
cc: @karol-kokoszka
@zimnx @Michal-Leszczynski Indeed, session creation didn't close the client ... argh...
In one of QA suites for Scylla Operator test scheduled 20 ad-hoc repairs for single cluster and then Scylla Operator tried to update most of them. Manager wasn't able to validate the tasks, as it couldn't connect to healthy Scylla nodes due to lack of sockets.
There're a lot of logs suggesting that new http client is created.
Looks like a http client/connection leak or unbound http client.
Run: https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.12/job/eks/job/longevity-scylla-operator-3h-eks-repair-test/3/ Logs: https://cloudius-jenkins-test.s3.amazonaws.com/b1c45503-3f4f-4197-ae76-359835e2c0b2/20240325_174610/kubernetes-must-gather-b1c45503.tar.gz Scylla Manager logs are under
must-gather/namespaces/scylla-manager/pods/scylla-manager-6f6556fd64-jcrrx