scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

sctool takes more time (~3min) then before to retrieve manager-server version after upgrade #3849

Closed mikliapko closed 1 month ago

mikliapko commented 1 month ago

Steps:

  1. Manager 3.1.2 is installed.
  2. Upgrade manager server (to 3.2.7-0.20240513.769fc4f1a-SNAPSHOT)
  3. Upgrade and start manager agents
  4. Start manager server systemctl start scylla-manager
  5. Sleep 30 seconds and get manager version via sctool version

Actual result: Error Error: Get "http://127.0.0.1:12345/api/v1/version": dial tcp 127.0.0.1:12345: connect: connection refused while getting the version of client.

11:54:38  sdcm.mgmt.common.ScyllaManagerError: Encountered an error on sctool command: version: Encountered a bad command exit code!
11:54:38  Command: 'sudo sctool version'
11:54:38  Exit code: 1
11:54:38  Stdout:
11:54:38  Client version: 3.2.7-0.20240509.769fc4f1a
11:54:38  Stderr:
11:54:38  Error: Get "http://127.0.0.1:12345/api/v1/version": dial tcp 127.0.0.1:12345: connect: connection refused

Expected result: In older versions it was enough to sleep 30 seconds before running sctool version to get the versions. So, expected the same behavior here as well.

Additional info:

Michal-Leszczynski commented 1 month ago

From SM logs:

May 13 09:51:39  {"L":"INFO","T":"2024-05-13T09:51:39.603Z","M":"Schema up to date","keyspace":"scylla_manager","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:39  {"L":"INFO","T":"2024-05-13T09:51:39.627Z","M":"Initializing the clusters configuration cache","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:39  {"L":"INFO","T":"2024-05-13T09:51:39.632Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"c0e3ca23-81c5-4275-a1d6-9e8a94f305b2","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:39  {"L":"INFO","T":"2024-05-13T09:51:39.639Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"927.211008ms","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:40  {"L":"INFO","T":"2024-05-13T09:51:40.567Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"2.000883071s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:42  {"L":"INFO","T":"2024-05-13T09:51:42.569Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"4.495824842s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:47  {"L":"INFO","T":"2024-05-13T09:51:47.066Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"7.07442966s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:51:54  {"L":"INFO","T":"2024-05-13T09:51:54.142Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"13.149685425s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:52:07  {"L":"INFO","T":"2024-05-13T09:52:07.293Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"32.521097873s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:52:39  {"L":"INFO","T":"2024-05-13T09:52:39.816Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"35.929166916s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:53:15  {"L":"INFO","T":"2024-05-13T09:53:15.745Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"31.28088056s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:53:47  {"L":"INFO","T":"2024-05-13T09:53:47.027Z","N":"cluster.client","M":"HTTP retry backoff","operation":"StorageServiceHostIdGet","wait":"27.507559245s","error":"dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:54:14  {"L":"ERROR","T":"2024-05-13T09:54:14.537Z","N":"cluster","M":"Cannot find known hosts using coordinator host","host":"","error":"giving up after 10 attempts: dial tcp :10001: connect: connection refused","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw","errorStack":"github.com/scylladb/scylla-manager/v3/pkg/scyllaclient.(*retryableOperation).submit\n\tgithub.com/scylladb/scylla-manager/v3/pkg/scyllaclient/retry.go:177\ngithub.com/scylladb/scylla-manager/v3/pkg/scyllaclient.retryableTransport.Submit\n\tgithub.com/scylladb/scylla-manager/v3/pkg/scyllaclient/retry.go:155\ngithub.com/scylladb/scylla-manager/v3/swagger/gen/scylla/v1/client/operations.(*Client).StorageServiceHostIDGet\n\tgithub.com/scylladb/scylla-manager/v3/swagger/gen/scylla/v1/client/operations/operations_client.go:11791\ngithub.com/scylladb/scylla-manager/v3/pkg/scyllaclient.(*Client).Datacenters\n\tgithub.com/scylladb/scylla-manager/v3/pkg/scyllaclient/client_scylla.go:161\ngithub.com/scylladb/scylla-manager/v3/pkg/service/cluster.(*Service).discoverHosts\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/cluster/service.go:188\ngithub.com/scylladb/scylla-manager/v3/pkg/service/cluster.(*Service).CreateClientNoCache\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/cluster/service.go:162\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:131\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateAll.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:182\nruntime.goexit\n\truntime/asm_amd64.s:1695\n","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/cluster.(*Service).CreateClientNoCache\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/cluster/service.go:164\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:131\ngithub.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateAll.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/configcache/service.go:182"}
May 13 09:54:14  {"L":"INFO","T":"2024-05-13T09:54:14.541Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["us-east"],"_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}
May 13 09:54:14  {"L":"INFO","T":"2024-05-13T09:54:14.560Z","M":"Clusters config cache initialized","_trace_id":"GY1OgBL-SwWHDGF5bnfDhw"}

As it can be seen above, SM tries to connect to the host represented by an empty string "". It comes from:

func (s *Service) CreateClientNoCache(ctx context.Context, clusterID uuid.UUID) (*scyllaclient.Client, error) {
    s.logger.Info(ctx, "Creating new Scylla HTTP client", "cluster_id", clusterID)

    c, err := s.GetClusterByID(ctx, clusterID)
    if err != nil {
        return nil, err
    }

    config := scyllaclient.DefaultConfigWithTimeout(s.timeoutConfig)
    if c.Port != 0 {
        config.Port = fmt.Sprint(c.Port)
    }
    config.AuthToken = c.AuthToken
    config.Hosts = append([]string{c.Host}, c.KnownHosts...) // <- bug

This can't happen with clusters added to SM 3.2, as it persists the --host flag value in DB. That's not the case for SM 3.1, where only obtained KnownHosts were saved and --host discarded. This results in SM trying to connect to an empty host first, which slows down the process of creating scylla client and creates lots of error logs. The client is eventually created when SM moves to connecting to KnowHosts, so this is not a correctness issue.