Closed Michal-Leszczynski closed 2 months ago
@Michal-Leszczynski yes, seems that it creates client per single host update. It could be shared among all hosts. https://github.com/scylladb/scylla-manager/blob/6c5b723b72a5128a6c6cbb4f66f474dbf51396c0/pkg/service/configcache/service.go#L152-L192
^^ It could accept client and just reuse it.
@Michal-Leszczynski yes, seems that it creates client per single host update. It could be shared among all hosts.
I don't get it. The code pasted above shows that the client is created per cluster and not per host. Or am I missing something?
It's for every host update.
Actually you are right, it's per cluster.
refinement notes
We just need to do the test and log information in the test environment on who triggered the client creation. It can be even simple stack trace dump.
import(
"runtime/debug"
)
...
debug.PrintStack()
This test scenario has 3 nodes and we can see 3 almost simultaneous client creation originating from healthcheck's PingRest method, which uses cluster svc cached clients:
// Client is the cached ProviderFunc.
func (p *CachedProvider) Client(ctx context.Context, clusterID uuid.UUID) (*Client, error) {
p.mu.Lock()
c, ok := p.clients[clusterID]
p.mu.Unlock()
// Cache hit
if ok {
// Check if hosts did not change before returning
changed, err := c.client.CheckHostsChanged(ctx)
if err != nil {
p.logger.Error(ctx, "Cannot check if hosts changed", "error", err)
}
if c.ttl.After(timeutc.Now()) && !changed && err == nil {
return c.client, nil
}
}
// If not found or hosts changed create a new one
client, err := p.inner(ctx, clusterID) // <- all calls get here before any of them manages to set the new client
if err != nil {
return nil, err
}
c = clientTTL{
client: client,
ttl: timeutc.Now().Add(p.validity),
}
p.mu.Lock()
p.clients[clusterID] = c
p.mu.Unlock()
return c.client, nil
}
The problem is that when there are 3 simultaneous calls to get invalidated client, all of those calls will result in client creation. Client creation should be done under mutex and other calls should just wait for it to finish and take the already created client.
Another problem is that the function used for checking if hosts changed checks length of c.config.Hosts
which may contain duplicates:
func (c *Client) CheckHostsChanged(ctx context.Context) (bool, error) {
cur, err := c.hosts(ctx)
if err != nil {
return false, err
}
if len(cur) != len(c.config.Hosts) { // <- c.config.Hosts may contain duplicates
return true, err
}
return !strset.New(c.config.Hosts...).Has(cur...), nil
}
This results in recreating client because cache provider believes that hosts have changed even though that's not the case.
Example fragment which shows that duplicates are possible:
if c.Host != "" {
config.Hosts = []string{c.Host}
}
config.Hosts = append(config.Hosts, c.KnownHosts...)
The last part of the code which creates many clients (that are closed right after) is connected GetSession
:
// GetSession returns CQL session to provided cluster.
func (s *Service) GetSession(ctx context.Context, clusterID uuid.UUID, opts ...SessionConfigOption) (session gocqlx.Session, err error) {
s.logger.Info(ctx, "Get session", "cluster_id", clusterID)
client, err := s.CreateClientNoCache(ctx, clusterID)
if err != nil {
return session, errors.Wrap(err, "get client")
}
defer logutil.LogOnError(ctx, s.logger, client.Close, "Couldn't close scylla client")
cfg := gocql.NewCluster()
for _, opt := range opts {
if err := opt(ctx, clusterID, client, cfg); err != nil {
return session, err
}
}
// Fill hosts if they weren't specified by the options
if len(cfg.Hosts) == 0 {
sessionHosts, err := GetRPCAddresses(ctx, client, client.Config().Hosts)
if err != nil {
s.logger.Info(ctx, "Gets session", "err", err)
if errors.Is(err, ErrNoRPCAddressesFound) {
return session, err
}
}
cfg.Hosts = sessionHosts
}
ni, err := client.AnyNodeInfo(ctx)
@karol-kokoszka do you know why we can't use the cached client here? Cached client checks for changed hosts. Also, session discovers hosts on its own, so it should be fine even if when we miss some of them.
@karol-kokoszka do you know why we can't use the cached client here? Cached client checks for changed hosts. Also, session discovers hosts on its own, so it should be fine even if when we miss some of them.
Wanted to keep the cache available for healtcheck service only to just validate that everything is fine before applying it to other services. Let's change and test it in next release.
This are two fragments of SM logs:
75 node cluster
``` Jun 12 18:46:26 : {"L":"INFO","T":"2024-06-12T18:46:26.061Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"12 node cluster
``` Jul 06 07:09:10 {"L":"INFO","T":"2024-07-06T07:09:10.005Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"It looks like SM is creating a client per node. My suspicion is that something strange happens with the config cache service.
@karol-kokoszka