stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

Noobaa Mgmt not available #100

Closed rbo closed 2 years ago

rbo commented 2 years ago

https://noobaa-mgmt-openshift-storage.apps.cluster.coe.muc.redhat.com/

=> Application is not available

Application pod running, but have problems to connect to database:

$ oc logs noobaa-core-0  --tail=10
Jul-20 15:26:49.725 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry getaddrinfo ENOTFOUND noobaa-db-pg-0.noobaa-db-pg
Jul-20 15:26:52.725 [Upgrade/20]    [L0] core.util.postgres_client:: _connect: called with { host: 'noobaa-db-pg-0.noobaa-db-pg', user: 'noobaa', password: '139IFeBAV5huAg==', database: 'nbcore', port: 5432 }
Jul-20 15:26:52.734 [Upgrade/20] [ERROR] core.util.postgres_client:: apply_sql_functions execute error Error: getaddrinfo ENOTFOUND noobaa-db-pg-0.noobaa-db-pg
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:71:26) {
  errno: -3008,
  code: 'ENOTFOUND',
  syscall: 'getaddrinfo',
  hostname: 'noobaa-db-pg-0.noobaa-db-pg'
}
Jul-20 15:26:52.734 [Upgrade/20] [ERROR] core.util.postgres_client:: _connect: initial connect failed, will retry getaddrinfo ENOTFOUND noobaa-db-pg-0.noobaa-db-pg

Let's check the database:

oc get pods  | grep pg
noobaa-db-pg-0                                    0/1     Init:0/2   0          10m

 oc describe pod noobaa-db-pg-0 | grep -A100  ^Events
Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               11m                    default-scheduler        Successfully assigned openshift-storage/noobaa-db-pg-0 to sf2
  Warning  FailedAttachVolume      11m                    attachdetach-controller  Multi-Attach error for volume "pvc-db5a7fa8-df37-4587-9bc5-df5b7addf5a6" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  10m                    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-db5a7fa8-df37-4587-9bc5-df5b7addf5a6"
  Warning  FailedMount             2m28s (x3 over 9m16s)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[noobaa-postgres-config-volume db kube-api-access-n27kw noobaa-postgres-initdb-sh-volume]: timed out waiting for the condition
  Warning  FailedMount             76s (x9 over 9m59s)    kubelet                  MountVolume.MountDevice failed for volume "pvc-db5a7fa8-df37-4587-9bc5-df5b7addf5a6" : rpc error: code = Internal desc = rbd image rbd_sdd/csi-vol-c70016bc-af4a-11ec-92f9-0a580a810051 is still being used
  Warning  FailedMount             11s (x2 over 4m43s)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[db kube-api-access-n27kw noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume]: timed out waiting for the condition
rbo commented 2 years ago

Let's check with:

for pod in `kubectl -n openshift-storage get pods|grep rbdplugin|grep -v provisioner|awk '{print $1}'`; do echo $pod; kubectl exec -it -n openshift-storage $pod -c csi-rbdplugin -- rbd device list; done

on which node which rbp is attached.

Volume is attached to node inf4 but pg pod is running on sf2

I Don't know how to detach volume by hand. Let's reboot the node...

oc adm drain inf4.coe.muc.redhat.com --ignore-daemonsets --delete-emptydir-data --force
....

oc debug node/inf4.coe.muc.redhat.com
Starting pod/inf4coemucredhatcom-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.32.96.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# reboot
Terminated
sh-4.4# 
Removing debug pod ...
rbo commented 2 years ago

After reboot, node is ready again.

$ oc adm uncordon inf4.coe.muc.redhat.com
node/inf4.coe.muc.redhat.com uncordoned

Problem solved.