david-becher commented 4 years ago

This PR should solve the constant error messages (as described here: https://github.com/rancher/k3s/issues/1459) when using a small DB instance, e.g. db.t2.micro PostgresQL on AWS, which only can handle up to ~80 connections. This especially happens when listing all pods/workloads of the System project. I think it will just occur at any page which lists a lot of k3s resources.

The error messages also surfaced in the k3s journal logs:

ERRO[2020-05-28T10:17:52.444132093Z] error while range on /registry/configmaps/kube-system/cert-manager-cainjector-leader-election-core : pq: sorry, too many clients already
ERRO[2020-05-28T10:17:52.444589524Z] error while range on /registry/configmaps/kube-system/cert-manager-cainjector-leader-election : pq: sorry, too many clients already
E0528 10:17:52.452075   25763 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{Code:2, Message:"pq: sorry, too many clients already", Details:[]*any.Any(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}

Solution

My PR makes connection pooling, which is currently not configured at all by kine, adjustable.

I will also mention this in the linked issue, so this can get picked up by k3s as well. Ideally, k3s server should expose command line flags for its server, which would look something like this in production:

Flag	Default	Description
`--datastore-max-idle-connections`	`2`	Maximum number of idle connections used by datastore. If num <= 0, then no connections are retained
`--datastore-max-open-connections`	`0`(unlimited)	Maximum number of idle connections used by datastore. If num <= 0, then there is no limit
`--datastore-connection-max-lifetime`	`0s`	Maximum amount of time a connection may be reused. Defined as a parsable string, e.g., 1s, 2500ms, and 1h30m are all accepted values

Note: The default values are actually chosen to reflect the current behavior. sql.DB.maxIdle in database/sql is defaulted to 2. sql.DB.maxLifetime is never set explicitly in database/sql, so gets set to 0 when the struct gets created. Same is true for sql.DB.maxOpen.

This way, we could use


k3s server --datastore-max-idle-connections 10 --datastore-max-open-connections 80 datastore-connection-max-lifetime 10s

I also imagine that using environment variables would be quite nice


K3S_DATASTORE_MAX_IDLE_CONNECTIONS=10 K3S_DATASTORE_MAX_OPEN_CONNECTIONS=80 K3S_DATASTORE_CONNECTION_MAX_LIFETIME=10s k3s server

I compiled it into the k3s executable and tested it with various settings (important was to limit the --datastore-max-open-connections setting, as with small DBs this is the bottleneck) and I could not reproduce any more of those mentioned errors in the rancher UI.

Let me know, if the PR is good or if it needs adjustments :)

I think, in most cases, this won't even be an issue, if you have a DB which can handle lots of connections. Otherwise, most of the time reusing idle connections can really help keeping memory usage at the DB lower and also allows small-sized DBs to be used by k3s/kine.

Cheers! 🎉

brandond commented 4 years ago

I've done some basic testing of this and the actual impact seems to have been to increase latency and connection rate. I'm looking at the apiserver's etcd_request_duration_seconds_bucket metric, as well as the mysql_global_status_connections from the database server. The change here is between git master, and master with just this kine change.

david-becher commented 4 years ago

Oh, that’s unfortunate. Can you give me some indication regarding which parameters you configured the connection pooling with? Or did you leave it as is and just used the default values for the proposed changes that I made? Than I would take a look on how to improve the behavior. At least the timeouts I didn’t have anymore with the proposed changes (which however only seem to show on small resource restricted dbs).

brandond commented 4 years ago

I had just pulled in the updated kine and let it use the default ConnectionPoolConfig without patching it in to the server args, which meant that maxIdle was set to 0. I'll finish my hack job and test again with sane defaults. It might be worth preventing maxIdle from getting set to less than the default, just in case someone else tries to use kine with the defaults.

brandond commented 4 years ago

After retesting with maxIdle set to 2, performance is exactly the same (as expected). It might be worth putting a lower bound of 2 on that setting, as performance with maxIdle 0 is abysmal.

k3s-io / kine

Make connection pooling adjustable #37

Solution