stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

RHACM cluster API not responding #77

Closed DanielFroehlich closed 2 years ago

DanielFroehlich commented 2 years ago

After COE Lab Upgrade, RHACM cluster is not responding "Connection refused". Seems that the API server is not coming up. VMs are online, i can SSH into the nodes (even via API VIP), so infra looks good.

dfroehli@dfroehli-mac21 ~ % ssh root@stormshiftdeploy.coe.muc.redhat.com
Last login: Wed Apr 13 10:07:48 2022 from 10.39.193.110
[root@stormshiftdeploy ~]# export KUBECONFIG=/root/rhacm_install/auth/kubeconfig
[root@stormshiftdeploy ~]# oc get nodes
The connection to the server api.rhacm.stormshift.coe.muc.redhat.com:6443 was refused - did you specify the right host or port?
[root@stormshiftdeploy ~]# nslookup api.rhacm.stormshift.coe.muc.redhat.com
Server:     10.32.96.1
Address:    10.32.96.1#53

Non-authoritative answer:
Name:   api.rhacm.stormshift.coe.muc.redhat.com
Address: 10.32.105.47

[root@stormshiftdeploy ~]# ssh core@10.32.105.47
Last login: Mon Apr 11 10:53:31 2022 from 10.32.98.111
[core@rhacm-ncrfd-master-2 ~]$ 
github-actions[bot] commented 2 years ago

Heads up @cluster/rhacm-admin - the "cluster/rhacm" label was applied to this issue.

DanielFroehlich commented 2 years ago

@rbo would you be able to take a look after lunch?

rbo commented 2 years ago

I have problems with the DNS, I created a ticket at internal IT. Because it looks like the forwarding doesn't work very well: INC2187912

DanielFroehlich commented 2 years ago

Cluster API is still not reachable. DNS issue seems to be resolved:

[root@stormshiftdeploy ~]# export KUBECONFIG=/root/rhacm_install/auth/kubeconfig
[root@stormshiftdeploy ~]# oc get nodes
The connection to the server api.rhacm.stormshift.coe.muc.redhat.com:6443 was refused - did you specify the right host or port?
[root@stormshiftdeploy ~]# nslookup 
> ^C[root@stormshiftdeploy ~]# nslookup api.rhacm.stormshift.coe.muc.redhat.com
Server:     10.32.96.1
Address:    10.32.96.1#53

Non-authoritative answer:
Name:   api.rhacm.stormshift.coe.muc.redhat.com
Address: 10.32.105.47

I see the VIP ....105.47 active on master-2: image

something is seriously broken. @rbo , can you please take a look?

rbo commented 2 years ago

puh it looks like master-0 got a new IP Adress

[root@rhacm-ncrfd-master-2 etcd-pod-25]# cat etcd-pod.yaml  | jq | grep -A1 'IP",'
            "name": "NODE_rhacm_ncrfd_master_0_IP",
            "value": "10.32.111.43" => Current IP in RHEV: 10.32.111.47 ❌
--
            "name": "NODE_rhacm_ncrfd_master_1_IP",
            "value": "10.32.111.21" => Current IP in RHEV: 10.32.111.21 ✅
--
            "name": "NODE_rhacm_ncrfd_master_2_IP", 
            "value": "10.32.111.42" => Current IP in RHEV: 10.32.111.42 ✅

--
rbo commented 2 years ago

And etcd ist not starting propperly on master-2

[root@rhacm-ncrfd-master-2 ~]# crictl ps -a | grep etcd
96dd9dd2abb53       b055bdb1d2da181f6ec211d7131cb7ae7ed702c7aeb60be4359cd0e3be24f15d                                                         2 minutes ago       Exited              etcd                                          6862                15ea69c99ac16
112fab83ab29e       30d3ef38a509ceb186e08f78ce28419fa06e4a3a32323704d7252fe267ccbdac                                                         5 minutes ago       Exited              etcd-health-monitor                           6540                15ea69c99ac16
9943b7c4e9cf8       b055bdb1d2da181f6ec211d7131cb7ae7ed702c7aeb60be4359cd0e3be24f15d                                                         35 hours ago        Running             etcd-metrics                                  7                   15ea69c99ac16
1f37282d71c3c       b055bdb1d2da181f6ec211d7131cb7ae7ed702c7aeb60be4359cd0e3be24f15d                                                         35 hours ago        Running             etcdctl                                       7                   15ea69c99ac16
98e3c871d16c9       b055bdb1d2da181f6ec211d7131cb7ae7ed702c7aeb60be4359cd0e3be24f15d                                                         35 hours ago        Exited              etcd-resources-copy                           7                   15ea69c99ac16
2984024de8294       b055bdb1d2da181f6ec211d7131cb7ae7ed702c7aeb60be4359cd0e3be24f15d                                                         35 hours ago        Exited              etcd-ensure-env-vars                          7                   15ea69c99ac16
[root@rhacm-ncrfd-master-2 ~]# crictl logs --tail 25 96dd9dd2abb53
{"level":"info","ts":1652295800.773414,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_HEARTBEAT_INTERVAL","variable-value":"100"}
{"level":"info","ts":1652295800.7734299,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_INITIAL_CLUSTER_STATE","variable-value":"existing"}
{"level":"info","ts":1652295800.773445,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_NAME","variable-value":"rhacm-ncrfd-master-2"}
{"level":"info","ts":1652295800.773465,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"8589934592"}
{"level":"info","ts":1652295800.7734802,"caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SOCKET_REUSE_ADDRESS","variable-value":"true"}
{"level":"warn","ts":1652295800.773507,"caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_STATIC_POD_VERSION=26"}
{"level":"warn","ts":1652295800.7735207,"caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_INITIAL_CLUSTER="}
{"level":"warn","ts":1652295800.77353,"caller":"flags/flag.go:93","msg":"unrecognized environment variable","environment-variable":"ETCD_IMAGE=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e345ddf8d5858ecea620a71b614283f00b8bf46e13261f5c6e7800f151693fab"}
{"level":"info","ts":"2022-05-11T19:03:20.773Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--logger=zap","--log-level=info","--initial-advertise-peer-urls=https://10.32.111.42:2380","--cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-rhacm-ncrfd-master-2.crt","--key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-rhacm-ncrfd-master-2.key","--trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt","--client-cert-auth=true","--peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-rhacm-ncrfd-master-2.crt","--peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-rhacm-ncrfd-master-2.key","--peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt","--peer-client-cert-auth=true","--advertise-client-urls=https://10.32.111.42:2379","--listen-client-urls=https://0.0.0.0:2379,unixs://10.32.111.42:0","--listen-peer-urls=https://0.0.0.0:2380","--metrics=extensive","--listen-metrics-urls=https://0.0.0.0:9978"]}
{"level":"info","ts":"2022-05-11T19:03:20.773Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2022-05-11T19:03:20.773Z","caller":"embed/etcd.go:125","msg":"configuring socket options","reuse-address":true,"reuse-port":false}
{"level":"info","ts":"2022-05-11T19:03:20.773Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://0.0.0.0:2380"]}
{"level":"info","ts":"2022-05-11T19:03:20.773Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-rhacm-ncrfd-master-2.crt, key = /etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-rhacm-ncrfd-master-2.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt, client-cert-auth = true, crl-file = ","cipher-suites":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"]}
{"level":"info","ts":"2022-05-11T19:03:20.774Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://0.0.0.0:2379","unixs://10.32.111.42:0"]}
{"level":"info","ts":"2022-05-11T19:03:20.774Z","caller":"embed/etcd.go:598","msg":"pprof is enabled","path":"/debug/pprof"}
{"level":"info","ts":"2022-05-11T19:03:20.775Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.0","git-sha":"GitNotFound","go-version":"go1.16.12","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"rhacm-ncrfd-master-2","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.32.111.42:2380"],"listen-peer-urls":["https://0.0.0.0:2380"],"advertise-client-urls":["https://10.32.111.42:2379"],"listen-client-urls":["https://0.0.0.0:2379","unixs://10.32.111.42:0"],"listen-metrics-urls":["https://0.0.0.0:9978"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"existing","initial-cluster-token":"","quota-size-bytes":8589934592,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"warn","ts":1652295800.7758572,"caller":"fileutil/fileutil.go:57","msg":"check file permission","error":"directory \"/var/lib/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
panic: freepages: failed to get all reachable pages (page 3689637105831851568: out of bounds: 211979)

goroutine 105 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2(0xc000172060)
    /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1056 +0xe9
created by go.etcd.io/bbolt.(*DB).freepages
    /remote-source/cachito-gomod-with-deps/deps/gomod/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:1054 +0x1cd
[root@rhacm-ncrfd-master-2 ~]#
rbo commented 2 years ago

Even with changed permissions etcd will not start.

etcd is running on master-1

I don't know if it makes sense to spend more time recovering the cluster because it's a simple ipi installation and if I remember correctly nothing fancy is deployed.

rbo commented 2 years ago

Cluster destroyed:

[root@stormshiftdeploy ~]# ./ocp49binaries/openshift-install destroy cluster --dir rhacm_install/
INFO Stopping VM rhacm-ncrfd-master-1
INFO Stopping VM rhacm-ncrfd-worker-0-kwr25
INFO Stopping VM rhacm-ncrfd-worker-0-8tz4l
INFO Stopping VM rhacm-ncrfd-master-2
INFO Stopping VM rhacm-ncrfd-master-0
INFO Stopping VM rhacm-ncrfd-worker-0-fp8m5
INFO VM rhacm-ncrfd-worker-0-kwr25 powered off
INFO VM rhacm-ncrfd-worker-0-8tz4l powered off
INFO Removing VM rhacm-ncrfd-worker-0-kwr25
INFO Removing VM rhacm-ncrfd-worker-0-8tz4l
INFO VM rhacm-ncrfd-master-2 powered off
INFO Removing VM rhacm-ncrfd-master-2
INFO VM rhacm-ncrfd-master-0 powered off
INFO Removing VM rhacm-ncrfd-master-0
INFO VM rhacm-ncrfd-master-1 powered off
INFO VM rhacm-ncrfd-worker-0-fp8m5 powered off
INFO Removing VM rhacm-ncrfd-master-1
INFO Removing VM rhacm-ncrfd-worker-0-fp8m5
INFO Removing tag rhacm-ncrfd
INFO Removing Template rhacm-ncrfd-rhcos
INFO Time elapsed: 26s
[root@stormshiftdeploy ~]#
rbo commented 2 years ago

I will set up a new one when I have time.