Queries failing due to `no hosts available in the pool`.

ouamer-dahmani commented 3 weeks ago

Hello,

I am encountering issues where queries are not being retried despite a retry policy being configured when creating a new Cluster object.

Reads and writes work fine but then at some point we get errors on some of them: gocql: no hosts available in the pool. Delving in the code I see that it should indeed retry the queries (I forced a query execution error in the debugger).

I then added logging to the cluster:

cluster.Logger = logger
cluster.QueryObserver = logger
cluster.BatchObserver = logger
cluster.ConnectObserver = logger

The logger gets called for queries that succeed but never for those that fail. I wonder if it is because the queries are not even ran once due to no hosts being in the connection pool?

I sometimes see connection events before the failures (can be a few milliseconds or minutes) but that is not always the case and they are not error logs either. Connect: Dial Duration: 5.383348ms, Host: 10.173.92.242

I know that the network on my kubernetes cluster is a bit flaky sometimes but I assume this should be taken care of gracefully with reconnections on the connection pool and retries on the queries.

I am running version v1.13.0 of the driver. I see that v1.14.X have changes around connections but am unsure they are related to the issues I am having and have held off on updating due to lack of time to test it out.

dkropachev commented 3 weeks ago

Could you please provide your ClusterConfig including HostSelectionPolicy and retry policy.

ouamer-dahmani commented 3 weeks ago

Hello!

It is equivalent to the following. I used high values to see if it would help pass through the potential instability.

cluster := gocql.NewCluster(cfg.Hosts...)
cluster.Keyspace = cfg.Keyspace
cluster.Timeout = 5 * time.Second
cluster.RetryPolicy = &gocql.ExponentialBackoffRetryPolicy{
    Min:        500 * time.Millisecond,
    Max:        5 * time.Second,
    NumRetries: 5,
}
cluster.Consistency = gocql.LocalQuorum
cluster.Authenticator = cfg.Authenticator
cluster.PoolConfig.HostSelectionPolicy = gocql.RoundRobinHostPolicy()
cluster.DisableInitialHostLookup = false
cluster.DisableShardAwarePort = true

dkropachev commented 3 weeks ago

@ouamer-dahmani , what most likely happens is this:

Due to the unstable connection driver looses connections to all nodes at some point.
When it happens executor does not even get to retry policy, it just iterates over hosts provided by RoundRobinHostPolicy to find one that has connections to it and could be used to execute query. Since it finds no such hosts, it end up returning &Iter{err: ErrNoConnections}

It works the same way on modern version as well, so you can't fix it by upgrading the driver. I would suggest to manually retry on this error, until we fix retry logic

I am closing this issue in favor of https://github.com/scylladb/gocql/issues/326. But feel free to continue discussion here if it is related to given case.

scylladb / gocql

Queries failing due to `no hosts available in the pool`. #325