rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 404 forks source link

k3os internal dqlite HA unstable "failed to create dqlite connection: no available dqlite leader server found" #454

Open k1n6b0b opened 4 years ago

k1n6b0b commented 4 years ago

Version (k3OS / kernel)

k3os version v0.10.0 5.0.0-43-generic #47~18.04.1 SMP Wed Apr 1 16:27:01 UTC 2020

Architecture

k3os x86_64

3x hosts:

General cloud-config:

hostname: "k3os-1.REDACTED"
k3os:
  data_sources:
  modules:
    - kvm
    - nvme
  sysctl:
    kernel.printk: 4 4 1 7
    kernel.kptr_restrict: 1
  network:
    interfaces:
      eth0:
         dhcp: true
  ntp_servers:
    - ntp.[REDACTED]
  dns_nameservers:
    - xx.xx.xx.xx
  # Rancher User PW
  password: "REDACTED"
  token: "REDACTED"
  labels:
    region: na-us-01
    prupose: testing
  k3s_args:
    - server
    - "-v=3"
    - "--cluster-init"

subsequent hosts are joined with "--server=https://k3os-1.k3s.[REDACTED]:6443"

Describe the bug I've installed k3os multiple times using the internal HA db and each time have ended up with a corrupted install after deploying rancher.

Error from server: rpc error: code = Unknown desc = failed to create dqlite connection: no available dqlite leader server found

To Reproduce

  1. Deploy 3x k3os hosts with --cluster-init on the first, joining the 2nd two to the first
  2. Deploy rancher via helm
    kubectl.exe create namespace cattle-system
    kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem
    kubectl -n cattle-system create secret generic tls-ca-additional --from-file=cacerts.pem
    kubectl.exe -n cattle-system create secret tls tls-rancher-ingress --cert=rancher.k3s.[REDACTED] --key=ranch
    er.k3s.[REDACTED]
    helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher.k3s.[REDACTED] --set ingress.tls.source=secret --set privateCA=true --set additionalTrustedCAs=true

Expected behavior

Actual behavior

k3os-2 [~]$ kubectl cluster-info

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Error from server: rpc error: code = Unknown desc = failed to create dqlite connection: no available dqlite leader server found
level=info msg="Starting k3s v1.17.4+k3s1 (3eee8ac3)"
level=info msg="Cluster bootstrap already complete"
level=info msg="Testing connection to peers [10.10.11.13:6443 10.10.11.11:6443 10.10.11.12:6443]"
level=info msg="Connection OK to peers [10.10.11.13:6443 10.10.11.11:6443 10.10.11.12:6443]"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"
level=error msg="failed to setup db: not an error"

Also receiving a lot of TLS errors (Not sure if they are related)

http: TLS handshake error from 10.10.11.12:60032: read tcp 10.10.11.13:6443->10.10.11.12:60032: read: connection reset by peer
http: TLS handshake error from 10.10.11.12:37580: EOF

Additional context See log files: putty-k3os-1.k3s.REDACTED.log putty-k3os-2.k3s.REDACTED.log putty-k3os-3.k3s.REDACTED.log

2stacks commented 4 years ago

Looks like I'm having the same/similar issue. I have a 3 node k3os cluster and have installed Rancher with Helm 3. However, I'm using LetsEncrypt for my certificate. I was able to get rancher up long enough to pull a certificate with cert-manager and was able to log in to the UI. Very shortly after everything crashed. After rebooting all of my nodes I'm able to access the cluster but the Rancher pods all fail to start. Some of the errors I'm seeing seem DB related.

K3s Host

E0526 01:28:09.324679    2493 pod_workers.go:191] Error syncing pod 5c449ad0-eff1-4733-a9f8-eab5a4c47e14 ("rancher-66b5cfc7f5-mkq4p_cattle-system(5c449ad0-eff1-4733-a9f8-eab5a4c47e14)"), skipping: failed to "StartContainer" for "rancher" with CrashLoopBackOff: "back-off 5m0s restarting failed container=rancher pod=rancher-66b5cfc7f5-mkq4p_cattle-system(5c449ad0-eff1-4733-a9f8-eab5a4c47e14)"
time="2020-05-26T01:28:14.688340754Z" level=error msg="failed to record compact revision: database is locked"
E0526 01:30:39.121645    5199 pod_workers.go:191] Error syncing pod 66d84ddb-d7df-434c-befa-8aad17c32b2c ("cattle-cluster-agent-8497bbc7cc-dpw4j_cattle-system(66d84ddb-d7df-434c-befa-8aad17c32b2c)"), skipping: failed to "StartContainer" for "cluster-register" with CrashLoopBackOff: "back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-8497bbc7cc-dpw4j_cattle-system(66d84ddb-d7df-434c-befa-8aad17c32b2c)"
time="2020-05-26T01:30:46.905084419Z" level=error msg="error in txn: database is locked"
E0526 01:30:46.905381    5199 status.go:71] apiserver received an error that is not an metav1.Status: &status.statusError{Code:2, Message:"database is locked", Details:[]*any.Any(nil), XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
E0526 01:30:46.905965    5199 autoregister_controller.go:194] v1.monitoring.coreos.com failed with : rpc error: code = Unknown desc = database is locked

Rancher Pod

E0526 01:38:22.729275       7 reflector.go:153] github.com/rancher/norman/controller/generic_controller.go:229: Failed to list *v3.ClusterTemplate: the server could not find the requested resource (get clustertemplates.meta.k8s.io)
E0526 01:38:22.731363       7 reflector.go:153] github.com/rancher/norman/controller/generic_controller.go:229: Failed to list *v3.CisBenchmarkVersion: the server could not find the requested resource (get cisbenchmarkversions.meta.k8s.io)
E0526 01:38:22.731887       7 reflector.go:153] github.com/rancher/norman/controller/generic_controller.go:229: Failed to list *v3.CatalogTemplate: the server could not find the requested resource (get catalogtemplates.meta.k8s.io)
k1n6b0b commented 4 years ago

@dweomer How does this get assigned/noticed? I'd love to leverage this platform, but i need it to work šŸ˜¬ I'm happy to help, debug, provide info -- my skillsets arent in programming, but I can build/test systems

thehedgefrog commented 4 years ago

Getting similar instability with a dqlite setup, with K3os as well as K3s on Ubuntu 18.04, both installed manually and with k3supĀ­. Getting varied results but cannot get a stable workaround.

If I install Rancher on one K3s and/or K3os node, it works fine, but then I can't add more nodes.