k3os internal dqlite HA unstable "failed to create dqlite connection: no available dqlite leader server found"

k1n6b0b commented 4 years ago

Version (k3OS / kernel)

k3os version v0.10.0 5.0.0-43-generic #47~18.04.1 SMP Wed Apr 1 16:27:01 UTC 2020

Architecture

k3os x86_64

3x hosts:

k3os-1
k3os-2
k3os-3

General cloud-config:

hostname: "k3os-1.REDACTED"
k3os:
  data_sources:
  modules:
    - kvm
    - nvme
  sysctl:
    kernel.printk: 4 4 1 7
    kernel.kptr_restrict: 1
  network:
    interfaces:
      eth0:
         dhcp: true
  ntp_servers:
    - ntp.[REDACTED]
  dns_nameservers:
    - xx.xx.xx.xx
  # Rancher User PW
  password: "REDACTED"
  token: "REDACTED"
  labels:
    region: na-us-01
    prupose: testing
  k3s_args:
    - server
    - "-v=3"
    - "--cluster-init"

subsequent hosts are joined with "--server=https://k3os-1.k3s.[REDACTED]:6443"

Describe the bug I've installed k3os multiple times using the internal HA db and each time have ended up with a corrupted install after deploying rancher.

Error from server: rpc error: code = Unknown desc = failed to create dqlite connection: no available dqlite leader server found

To Reproduce

Deploy 3x k3os hosts with --cluster-init on the first, joining the 2nd two to the first

Deploy rancher via helm

kubectl.exe create namespace cattle-system
kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem
kubectl -n cattle-system create secret generic tls-ca-additional --from-file=cacerts.pem
kubectl.exe -n cattle-system create secret tls tls-rancher-ingress --cert=rancher.k3s.[REDACTED] --key=ranch
er.k3s.[REDACTED]
helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher.k3s.[REDACTED] --set ingress.tls.source=secret --set privateCA=true --set additionalTrustedCAs=true

Expected behavior

Rancher and k3s to stay stable

Actual behavior

Rancher starts and appears to deploy successfully
Unable to proceed, cluster becomes unstable (tried deploying syslog configs, or making any changes)
dqlight blows up i think. See logs
Reboots of any hosts dont show failover successfully
All hosts eventually fall into this cannot connect to db state

k3os-2 [~]$ kubectl cluster-info

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Error from server: rpc error: code = Unknown desc = failed to create dqlite connection: no available dqlite leader server found

level=info msg="Starting k3s v1.17.4+k3s1 (3eee8ac3)"
level=info msg="Cluster bootstrap already complete"
level=info msg="Testing connection to peers [10.10.11.13:6443 10.10.11.11:6443 10.10.11.12:6443]"
level=info msg="Connection OK to peers [10.10.11.13:6443 10.10.11.11:6443 10.10.11.12:6443]"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"

Also receiving a lot of TLS errors (Not sure if they are related)

http: TLS handshake error from 10.10.11.12:60032: read tcp 10.10.11.13:6443->10.10.11.12:60032: read: connection reset by peer
http: TLS handshake error from 10.10.11.12:37580: EOF

Additional context See log files: putty-k3os-1.k3s.REDACTED.log putty-k3os-2.k3s.REDACTED.log putty-k3os-3.k3s.REDACTED.log

2stacks commented 4 years ago

Looks like I'm having the same/similar issue. I have a 3 node k3os cluster and have installed Rancher with Helm 3. However, I'm using LetsEncrypt for my certificate. I was able to get rancher up long enough to pull a certificate with cert-manager and was able to log in to the UI. Very shortly after everything crashed. After rebooting all of my nodes I'm able to access the cluster but the Rancher pods all fail to start. Some of the errors I'm seeing seem DB related.

K3s Host

E0526 01:28:09.324679    2493 pod_workers.go:191] Error syncing pod 5c449ad0-eff1-4733-a9f8-eab5a4c47e14 ("rancher-66b5cfc7f5-mkq4p_cattle-system(5c449ad0-eff1-4733-a9f8-eab5a4c47e14)"), skipping: failed to "StartContainer" for "rancher" with CrashLoopBackOff: "back-off 5m0s restarting failed container=rancher pod=rancher-66b5cfc7f5-mkq4p_cattle-system(5c449ad0-eff1-4733-a9f8-eab5a4c47e14)"
time="2020-05-26T01:28:14.688340754Z" level=error msg="failed to record compact revision: database is locked"
E0526 01:30:39.121645    5199 pod_workers.go:191] Error syncing pod 66d84ddb-d7df-434c-befa-8aad17c32b2c ("cattle-cluster-agent-8497bbc7cc-dpw4j_cattle-system(66d84ddb-d7df-434c-befa-8aad17c32b2c)"), skipping: failed to "StartContainer" for "cluster-register" with CrashLoopBackOff: "back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-8497bbc7cc-dpw4j_cattle-system(66d84ddb-d7df-434c-befa-8aad17c32b2c)"
time="2020-05-26T01:30:46.905084419Z" level=error msg="error in txn: database is locked"
E0526 01:30:46.905381    5199 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{Code:2, Message:"database is locked", Details:[]*any.Any(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
E0526 01:30:46.905965    5199 autoregister_controller.go:194] v1.monitoring.coreos.com failed with : rpc error: code = Unknown desc = database is locked

Rancher Pod

E0526 01:38:22.729275       7 reflector.go:153] github.com/rancher/norman/controller/generic_controller.go:229: Failed to list *v3.ClusterTemplate: the server could not find the requested resource (get clustertemplates.meta.k8s.io)
E0526 01:38:22.731363       7 reflector.go:153] github.com/rancher/norman/controller/generic_controller.go:229: Failed to list *v3.CisBenchmarkVersion: the server could not find the requested resource (get cisbenchmarkversions.meta.k8s.io)
E0526 01:38:22.731887       7 reflector.go:153] github.com/rancher/norman/controller/generic_controller.go:229: Failed to list *v3.CatalogTemplate: the server could not find the requested resource (get catalogtemplates.meta.k8s.io)

k1n6b0b commented 4 years ago

@dweomer How does this get assigned/noticed? I'd love to leverage this platform, but i need it to work 😬 I'm happy to help, debug, provide info -- my skillsets arent in programming, but I can build/test systems

thehedgefrog commented 4 years ago

Getting similar instability with a dqlite setup, with K3os as well as K3s on Ubuntu 18.04, both installed manually and with k3sup. Getting varied results but cannot get a stable workaround.

On K3os, able to setup a cluster and install Rancher but it never deploys 3/3.
On Ubuntu, I get "database is locked" messages for cert-manager and Rancher, cannot get a working install
At one point I got Rancher working but it was prompting me for my password while I hadn't set it yet.

If I install Rancher on one K3s and/or K3os node, it works fine, but then I can't add more nodes.

rancher / k3os

k3os internal dqlite HA unstable "failed to create dqlite connection: no available dqlite leader server found" #454