Upgrade to v1.19.3+k3s2 killed master.

dragon2611 commented 4 years ago

Environmental Info: k3s version v1.19.3+k3s2 (f8a4547b)

Node(s) CPU architecture, OS, and Version: 5.3.0-1038-gke (ubuntu 18.04) 5.3.0-1038-gke

Cluster Configuration: 1 Master (Postgres) 1 worker networking replaced with calcio

Describe the bug: Crashlooping after upgrade from 1.19.3+K3S1 > 1.19.3+K3S2

Steps To Reproduce: Updated K3s from 1.19.3+k3s1 to k3s version v1.19.3+k3s2 (f8a4547b) Expected behavior: Cluster to start

Actual behavior: Crashloop, It seems

Additional context / logs: Lots of

Note: system:node:.in-addr.arpa is the master node itself.

Nov 5 17:44:27 14 k3s[16002]: E1105 17:44:27.754208 16002 reflector.go:127] object-"default"/"reflector-token-rhn9x": Failed to watch *v1.Secret: failed to list *v1.Secret: secrets "reflector-token-rhn9x" is forbidden: User "system:node:<ipremoved>.in-addr.arpa" cannot list resource "secrets" in API group "" in the namespace "default": no relationship found between node '<ipremoved>.in-addr.arpa' and this object

I did manage to stablise things by starting k3s with --disable-agent and then deleting the master node (kubectl delete node) then re-adding it to the cluster (My master runs very little)

Edit:

I'm actually considering that maybe the master should not run the agent at all and solely be responsible for managing the worker.

dragon2611 commented 4 years ago

Not sure if this helps, Also I have the syslog but it's rather large and also I don't really want to post it to github, Happy to share it with one of the main contributors if it helps, K3S seemed to log a hell of a lot to /var/log/syslog including what I think might be a rather long stacktrace.

Nov 5 17:02:10 14 k3s[659]: F1105 17:02:10.619226 659 controllermanager.go:262] Failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[-]poststarthook/start-apiextensions-controllers failed: reason withheld\n[-]poststarthook/crd-informer-synced failed: reason withheld\n[+]poststarthook/bootstrap-controller ok\n[+]poststarthook/rbac/bootstrap-roles ok\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\nhealthz check failed") has prevented the request from succeeding Nov 5 17:02:11 14 k3s[659]: goroutine 5482 [running]: Nov 5 17:02:11 14 k3s[659]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.stacks(0xc000120001, 0xc01297c000, 0x4a2, 0x4e9) Nov 5 17:02:11 14 k3s[659]: #011/go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:996 +0xb9 Nov 5 17:02:11 14 k3s[659]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).output(0x6ea13a0, 0xc000000003, 0x0, 0x0, 0xc0034f89a0, 0x6b26164, 0x14, 0x106, 0x0) Nov 5 17:02:11 14 k3s[659]: #011/go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:945 +0x191 Nov 5 17:02:11 14 k3s[659]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).printf(0x6ea13a0, 0xc000000003, 0x0, 0x0, 0x456a9f0, 0x2e, 0xc015d077b0, 0x1, 0x1) Nov 5 17:02:11 14 k3s[659]: #011/go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:733 +0x17a Nov 5 17:02:11 14 k3s[659]: github.com/rancher/k3s/vendor/k8s.io/klog/v2.Fatalf(...)

brandond commented 4 years ago

The error embedded in there indicates that one of the core controllers has failed; the full logs might show why. I appreciate you not just pasting them all into the issue text - can you attach them instead?

[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[-]poststarthook/start-apiextensions-controllers failed: reason withheld
[-]poststarthook/crd-informer-synced failed: reason withheld
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed

FWIW a similar error was reported in #2425 on v1.19.2+k3s1 so I don't think the issue is necessarily related to the upgrade. Logs from that issue had:

+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[-]poststarthook/rbac/bootstrap-roles failed: reason withheld
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check failed

dragon2611 commented 4 years ago

Is it likely to have logged anywhere else other than syslog? I'm not overly keen on posting the entire syslog file to a public post just in case anything sensitive made its way there (ideally it shouldn't have ofc)

brandond commented 4 years ago

On systemd-based systems you can get it from the journal: journalctl -u k3s --no-pager

dragon2611 commented 4 years ago

That would have been a better way to do it, I noticed the syslog lines all started K3S so Pulled them based on that.

Also removed (I think) the public IP's/RDNS name Log removed

brandond commented 4 years ago

It looks like either you are either CPU or IO bound - even before the upgrade at a steady state you were getting warnings logged for slow etcd operations - anything over 500ms is considered slow:

Nov  5 16:58:51 14 k3s[10225]: I1105 16:58:51.424700   10225 trace.go:205] Trace[1933635332]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (05-Nov-2020 16:58:50.706) (total time: 718ms):
Nov  5 16:58:51 14 k3s[10225]: Trace[1933635332]: [718.131197ms] [718.131197ms] END
Nov  5 16:58:51 14 k3s[10225]: I1105 16:58:51.556005   10225 trace.go:205] Trace[1745591315]: "Update" url:/apis/longhorn.io/v1beta1/namespaces/longhorn-system/volumes/seafile/status,user-agent:longhorn-manager/v0.0.0 (linux/amd64) kubernetes/$Format,client:ipremoved (05-Nov-2020 16:58:51.014) (total time: 541ms):
Nov  5 16:58:51 14 k3s[10225]: Trace[1745591315]: ---"Object stored in database" 539ms (16:58:00.555)
Nov  5 16:58:51 14 k3s[10225]: Trace[1745591315]: [541.584455ms] [541.584455ms] END
Nov  5 16:58:51 14 k3s[10225]: I1105 16:58:51.575535   10225 trace.go:205] Trace[205073008]: "Update" url:/apis/longhorn.io/v1beta1/namespaces/longhorn-system/engines/seafile-e-09ddd2bf/status,user-agent:longhorn-manager/v0.0.0 (linux/amd64) kubernetes/$Format,client:ipremoved (05-Nov-2020 16:58:50.697) (total time: 877ms):
Nov  5 16:58:51 14 k3s[10225]: Trace[205073008]: ---"Object stored in database" 723ms (16:58:00.424)
Nov  5 16:58:51 14 k3s[10225]: Trace[205073008]: [877.464578ms] [877.464578ms] END
Nov  5 16:58:52 14 k3s[10225]: I1105 16:58:52.100528   10225 trace.go:205] Trace[118517014]: "Delete" url:/apis/management.cattle.io/v3/namespaces/p-tkfpk/projectalertgroups (05-Nov-2020 16:58:51.510) (total time: 589ms):
Nov  5 16:58:52 14 k3s[10225]: Trace[118517014]: [589.861886ms] [589.861886ms] END
Nov  5 16:58:53 14 k3s[10225]: I1105 16:58:53.695170   10225 trace.go:205] Trace[1314379382]: "Delete" url:/apis/management.cattle.io/v3/namespaces/local/clusteralertgroups (05-Nov-2020 16:58:52.469) (total time: 1225ms):
Nov  5 16:58:53 14 k3s[10225]: Trace[1314379382]: [1.225468452s] [1.225468452s] END
Nov  5 16:59:00 14 k3s[10225]: E1105 16:59:00.802822   10225 crd_finalizer.go:299] clusters.management.cattle.io failed with: timed out waiting for the condition
Nov  5 16:59:03 14 k3s[10225]: I1105 16:59:03.858929   10225 trace.go:205] Trace[1561730586]: "Update" url:/apis/longhorn.io/v1beta1/namespaces/longhorn-system/engines/thelounge-e-dd667621/status,user-agent:longhorn-manager/v0.0.0 (linux/amd64) kubernetes/$Format,client:ipremoved (05-Nov-2020 16:59:03.356) (total time: 502ms):
Nov  5 16:59:03 14 k3s[10225]: Trace[1561730586]: ---"Object stored in database" 496ms (16:59:00.854)
Nov  5 16:59:03 14 k3s[10225]: Trace[1561730586]: [502.678312ms] [502.678312ms] END

Immediately after the upgrade, operations were dangerously close to the 10 second max after which Kubernetes internal operations start raising critical faults that will cause the system to exit:

Nov  5 17:02:02 14 k3s[659]: I1105 17:02:02.004927     659 trace.go:205] Trace[55813725]: "Delete" url:/apis/apiregistration.k8s.io/v1/apiservices/v1.operator.tigera.io,user-agent:k3s/v1.19.3+k3s2 (linux/amd64) kubernetes/f8a4547,client:127.0.0.1 (05-Nov-2020 17:02:00.090) (total time: 1914ms):
Nov  5 17:02:02 14 k3s[659]: Trace[55813725]: ---"Object deleted from database" 1913ms (17:02:00.004)
Nov  5 17:02:02 14 k3s[659]: Trace[55813725]: [1.914228583s] [1.914228583s] END
Nov  5 17:02:02 14 k3s[659]: I1105 17:02:02.167549     659 trace.go:205] Trace[1364862330]: "GuaranteedUpdate etcd3" type:*coordination.Lease (05-Nov-2020 17:01:53.622) (total time: 8544ms):
Nov  5 17:02:02 14 k3s[659]: Trace[1364862330]: ---"Transaction prepared" 6406ms (17:02:00.029)
Nov  5 17:02:02 14 k3s[659]: Trace[1364862330]: ---"Transaction committed" 2138ms (17:02:00.167)
Nov  5 17:02:02 14 k3s[659]: Trace[1364862330]: [8.544659518s] [8.544659518s] END
Nov  5 17:02:02 14 k3s[659]: I1105 17:02:02.167738     659 trace.go:205] Trace[59372318]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ipremoved2.in-addr.arpa,user-agent:k3s/v1.19.3+k3s1 (linux/amd64) kubernetes/974ad30,client:ipremoved (05-Nov-2020 17:01:53.622) (total time: 8545ms):
Nov  5 17:02:02 14 k3s[659]: Trace[59372318]: ---"Object stored in database" 8544ms (17:02:00.167)
Nov  5 17:02:02 14 k3s[659]: Trace[59372318]: [8.545045834s] [8.545045834s] END
Nov  5 17:02:02 14 k3s[659]: I1105 17:02:02.582205     659 trace.go:205] Trace[648715365]: "Create" url:/api/v1/namespaces/calico-system/events,user-agent:k3s/v1.19.3+k3s1 (linux/amd64) kubernetes/974ad30,client:ipremoved (05-Nov-2020 17:01:53.452) (total time: 9129ms):
Nov  5 17:02:02 14 k3s[659]: Trace[648715365]: ---"Object stored in database" 9129ms (17:02:00.582)
Nov  5 17:02:02 14 k3s[659]: Trace[648715365]: [9.129398633s] [9.129398633s] END
Nov  5 17:02:03 14 k3s[659]: I1105 17:02:03.384053     659 controller.go:132] OpenAPI AggregationController: action for item : Nothing (removed from the queue).
Nov  5 17:02:03 14 k3s[659]: I1105 17:02:03.384121     659 controller.go:132] OpenAPI AggregationController: action for item k8s_internal_local_delegation_chain_0000000000: Nothing (removed from the queue).
Nov  5 17:02:04 14 k3s[659]: E1105 17:02:04.488510     659 controller.go:156] Unable to remove old endpoints from kubernetes service: no master IPs were listed in storage, refusing to erase all endpoints for the kubernetes service
Nov  5 17:02:04 14 k3s[659]: I1105 17:02:04.767020     659 trace.go:205] Trace[1211414567]: "Delete" url:/apis/apiregistration.k8s.io/v1/apiservices/v1.helm.cattle.io,user-agent:k3s/v1.19.3+k3s2 (linux/amd64) kubernetes/f8a4547,client:127.0.0.1 (05-Nov-2020 17:02:00.090) (total time: 4676ms):
Nov  5 17:02:04 14 k3s[659]: Trace[1211414567]: ---"Object deleted from database" 4675ms (17:02:00.766)
Nov  5 17:02:04 14 k3s[659]: Trace[1211414567]: [4.676718984s] [4.676718984s] END
Nov  5 17:02:04 14 k3s[659]: I1105 17:02:04.949506     659 trace.go:205] Trace[338465704]: "GuaranteedUpdate etcd3" type:*core.Node (05-Nov-2020 17:01:57.034) (total time: 7914ms):
Nov  5 17:02:04 14 k3s[659]: Trace[338465704]: ---"Transaction prepared" 2963ms (17:01:00.998)
Nov  5 17:02:04 14 k3s[659]: Trace[338465704]: ---"Transaction committed" 4950ms (17:02:00.949)
Nov  5 17:02:04 14 k3s[659]: Trace[338465704]: [7.914914321s] [7.914914321s] END
Nov  5 17:02:04 14 k3s[659]: I1105 17:02:04.950230     659 trace.go:205] Trace[1326105865]: "Update" url:/api/v1/nodes/ipremoved.in-addr.arpa,user-agent:k3s/v1.19.3+k3s2 (linux/amd64) kubernetes/f8a4547,client:127.0.0.1 (05-Nov-2020 17:01:57.033) (total time: 7916ms):
Nov  5 17:02:04 14 k3s[659]: Trace[1326105865]: ---"Object stored in database" 7915ms (17:02:00.949)
Nov  5 17:02:04 14 k3s[659]: Trace[1326105865]: [7.91664474s] [7.91664474s] END
Nov  5 17:02:04 14 k3s[659]: time="2020-11-05T17:02:04.956541406-05:00" level=info msg="labels have been set successfully on node: ipremoved.in-addr.arpa"

Can you provide iotop -a and dstat 5 output for this system?

brandond commented 4 years ago

Oh, also, you said you're using postgres but the logs say:

Nov  5 17:01:36 14 k3s[659]: time="2020-11-05T17:01:36.249175862-05:00" level=info msg="Configuring sqlite3 database connection pooling: maxIdleConns=2, maxOpenConns=0, connMaxLifetime=0s"

Are you intending to use an external postgres database instead of sqlite? If so, how did you configure it?

dragon2611 commented 4 years ago

Ah right it's running as a vm, but I think the host node is fairly old and uses hdds, wouldn't surprise me if the bbu in the raid card has gone bad and disabled write caching. It is the only VM on that machine.

Yes I has intended to use postgres, it's running on the same VM (the idea was to move it later if needed)

I can see postgres mentioned in the systemD file, I can post it later but in mobile at the moment, I think there's a space where there shouldn't be though

I suspect what's happened is I've messed up passing the postgres argument to the install script and it's just ignored it rather than error and started with sqlite and I rather embarrassingly never noticed.

dragon2611 commented 4 years ago

' --datastore-endpoint=postgres://' \ '--disable=traefik' \ '--flannel-backend=none' \ '--disable=local-storage' \ '--docker' \ '--kube-apiserver-arg' 'service-node-port-range=900-65535' \ '--enable-admission-plugins' 'AlwaysPullImages' \'

https://rancher.com/docs/k3s/latest/en/installation/datastore/#datastore-endpoint-format-and-functionality

I'm guessing it's that space between the ' and the --datastore-endpoint ?

dragon2611 commented 4 years ago

@brandond for now I've managed to migrate the master and pop it on a CX11-Ceph over at hetzner where it seems to be happier it's running as an agentless master so the little CX11 doesn't have to worry about running any pods.

At some point I'll probably flatten everything and redeploy as I built it when I was fairly early on with learning Kubernetes and there are certain things I would do somewhat differently now.

Using Sqlite was unintentional and caused by a mistake that went unnoticed when installing K3S.

brandond commented 4 years ago

Yeah, that leading quoted space is causing the datastore-endpoint flag to get ignored. Definitely funny that it ran so long like that!

k3s-io / k3s

Upgrade to v1.19.3+k3s2 killed master. #2477