openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
756 stars 109 forks source link

After rebooting worker nodes msp in pending state and daemonset are unavailable #931

Closed krishnakekan619 closed 2 years ago

krishnakekan619 commented 3 years ago

Describe the bug We have previously installed mayastor on our on premise RKE cluster After rebooting worker nodes we faced an issue-

To Reproduce Steps to reproduce the behavior: We need to reboot the worker nodes where storage blocks are present

Expected behavior after rebooting the worker node, mayastor has to work smoothly and msp should shows in online status

Screenshots after rebooting the worker nodes mayastor MSP status showing me as a pending

~# kubectl get msp -n mayastor -o wide
NAME                NODE   STATE     AGE
pool-on-node-wl01   wl01   pending   100d
pool-on-node-wl02   wl02   online    100d
pool-on-node-wl03   wl03   pending   100d

also below daemonset shows only 1 daeomonset is in ready status

NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                         AGE
mayastor       3         3         1       2            1           kubernetes.io/arch=amd64,openebs.io/engine=mayastor   63d
mayastor-csi   3         3         3       3            3           kubernetes.io/arch=amd64                              97d

OS info (please complete the following information):

Additional context we have checked below things

Do u suggest us any workaround so that even in the future if anyone reboot the worker nodes mayastor should work smoothly? @jkryl Thanks in advance

krishnakekan619 commented 3 years ago

Hi @jkryl @gila, Hope you guys are doing good today!

kubectl logs -f pod/mayastor-brgrw -n mayastor

[2021-06-11T07:22:43.668504167+00:00 INFO mayastor:main.rs:46] Starting Mayastor .. [2021-06-11T07:22:43.668613480+00:00 INFO mayastor:main.rs:47] kernel io_uring support: no [2021-06-11T07:22:43.668629087+00:00 INFO mayastor:main.rs:51] free_pages: 2048 nr_pages: 2048 thread 'main' panicked at 'Invalid Host Name: Custom { kind: Other, error: "failed to lookup address information: Name or service not known" }', mayastor/src/subsys/mbus/mod.rs:41:49 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

kubectl logs -n mayastor pod/mayastor-czrdt

[2021-06-11T10:34:22.085561709+00:00 INFO mayastor:main.rs:46] Starting Mayastor .. [2021-06-11T10:34:22.085666183+00:00 INFO mayastor:main.rs:47] kernel io_uring support: no [2021-06-11T10:34:22.085686827+00:00 INFO mayastor:main.rs:51] free_pages: 1024 nr_pages: 1024 thread 'main' panicked at 'Invalid Host Name: Custom { kind: Other, error: "failed to lookup address information: Name or service not known" }', mayastor/src/subsys/mbus/mod.rs:41:49 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace


2. before creating any pool on nodes it is picking previously created node-pools(while reinstalling)

kubectl get msp -n mayastor NAME NODE STATE AGE pool-on-node-wl01 wl01 pending 102d pool-on-node-wl02 wl02 pending 102d pool-on-node-wl03 wl03 pending 102d


Could you suggest us anything here? Thanks
tiagolobocastro commented 3 years ago

Hey @krishnakekan619, it seems like mayastor is not able to reach the nats service. The message has been improved recently but maybe you're still on an older version. Would you please be able to check the state of the nats deployment?

krishnakekan619 commented 3 years ago

Hi @jkryl @gila thanks for your immediate response

kubectl describe pod/nats-6fdd6dfb4f-58rtp -n mayastor
Name:         nats-6fdd6dfb4f-58rtp
Namespace:    mayastor
Priority:     0
Node:         cpocwl01/xx.xx.xx.1
Start Time:   Fri, 11 Jun 2021 13:59:26 +0800
Labels:       app=nats
              pod-template-hash=6fdd6dfb4f
Annotations:  cni.projectcalico.org/podIP: 10.42.87.194/32
              cni.projectcalico.org/podIPs: 10.42.87.194/32
Status:       Running
IP:           10.42.87.194
IPs:
  IP:           10.42.87.194
Controlled By:  ReplicaSet/nats-6fdd6dfb4f
Containers:
  nats:
    Container ID:   docker://5a3acc52d5093e394729b98157adc92b9b67067168ce594a103ab32f08f0015d
    Image:          nats:2.1-alpine3.11
    Image ID:       docker-pullable://nats@sha256:ebe6d1b23a177223608c68d8617049228b00ee54d4e758d2eca44238326b141b
    Port:           4222/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 11 Jun 2021 15:38:32 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 11 Jun 2021 15:30:27 +0800
      Finished:     Fri, 11 Jun 2021 15:37:29 +0800
    Ready:          True
    Restart Count:  3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x47dq (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-x47dq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-x47dq
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

[1] 2021/06/11 07:38:32.192405 [INF] Starting nats-server version 2.1.8 [1] 2021/06/11 07:38:32.192440 [INF] Git commit [c0b574f] [1] 2021/06/11 07:38:32.192747 [INF] Starting http monitor on 0.0.0.0:8222 [1] 2021/06/11 07:38:32.192815 [INF] Listening for client connections on 0.0.0.0:4222 [1] 2021/06/11 07:38:32.192831 [INF] Server id is NBZZ7GL3DVUCGAK6TRHJ6CMBF6A4IR5TUJVR5JV3XFJ5YOA2FW2QDJ [1] 2021/06/11 07:38:32.192835 [INF] Server is ready [1] 2021/06/11 07:38:32.193029 [INF] Listening for route connections on 0.0.0.0:6222


Any suggestions on this? Thanks
tiagolobocastro commented 3 years ago

Could you please describe the nats service as well? Also, if you could run a separate container and try to reach nats, eg: nc -vz nats 4222? Alternatively, you could also kubectl -n mayastor delete pod mayastor-xxxxxx and that should trigger the init-container to run again, and the init-container will probe nats for you.

krishnakekan619 commented 3 years ago

hi @jkryl @gila Thanks for your quick response,

i ran the another pod and tried netcat command and it s working -> below is the output:

[root@centos-01 /]# nc -vz 10.43.5.130 4222
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.43.5.130:4222.
Ncat: 0 bytes sent, 0 bytes received in 0.02 seconds

delete mayastor pods

kubectl delete pod/mayastor-4g9sv -n mayastor
pod "mayastor-4g9sv" deleted

kubectl delete pod/mayastor-ks4cd -n mayastor
pod "mayastor-ks4cd" deleted

status of daemonset pods

kubectl get pods -n mayastor -w
NAME                    READY   STATUS     RESTARTS   AGE
mayastor-5dr45          0/1     Init:0/1   0          105s
mayastor-5qf82          0/1     Init:0/1   0          81s

events of daemonset pods are still the same

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  2m1s  default-scheduler  Successfully assigned mayastor/mayastor-5dr45 to cpocwl01
  Normal  Pulled     2m1s  kubelet            Container image "busybox:latest" already present on machine
  Normal  Created    2m    kubelet            Created container message-bus-probe
  Normal  Started    2m    kubelet            Started container message-bus-probe

couldyou please provice us any other pointers on this? thanks

tiagolobocastro commented 3 years ago

can you get the logs from those pods now? It seems like they can't reach the nats service. Seems like the DNS service is not resolving nats?

krishnakekan619 commented 3 years ago

Hi @gila previously running daemonset pod logs are below

# kubectl logs -n mayastor pod/mayastor-brgrw
[2021-06-11T10:33:30.306704036+00:00  INFO mayastor:main.rs:46] Starting Mayastor ..
[2021-06-11T10:33:30.306872051+00:00  INFO mayastor:main.rs:47] kernel io_uring support: no
[2021-06-11T10:33:30.306909127+00:00  INFO mayastor:main.rs:51] free_pages: 2048 nr_pages: 2048
thread 'main' panicked at 'Invalid Host Name: Custom { kind: Other, error: "failed to lookup address information: Name or service not known" }', mayastor/src/subsys/mbus/mod.rs:41:49
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

# kubectl logs -n mayastor pod/mayastor-czrdt
[2021-06-11T10:34:22.085561709+00:00  INFO mayastor:main.rs:46] Starting Mayastor ..
[2021-06-11T10:34:22.085666183+00:00  INFO mayastor:main.rs:47] kernel io_uring support: no
[2021-06-11T10:34:22.085686827+00:00  INFO mayastor:main.rs:51] free_pages: 1024 nr_pages: 1024
thread 'main' panicked at 'Invalid Host Name: Custom { kind: Other, error: "failed to lookup address information: Name or service not known" }', mayastor/src/subsys/mbus/mod.rs:41:49
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Currently available daemonset pod logs

kubectl logs -n mayastor pod/mayastor-5dr45
Error from server (BadRequest): container "mayastor" in pod "mayastor-5dr45" is waiting to start: PodInitializing

kubectl logs -n mayastor pod/mayastor-5qf82
Error from server (BadRequest): container "mayastor" in pod "mayastor-5qf82" is waiting to start: PodInitializing

nats logs are already given in above comment, let me know if any other logs do you want?

@gila FYR below is overall logs detail in a text file mayastor-logs-gila-1.txt

tiagolobocastro commented 3 years ago

I need the pods from the init-container, you need to specify it: kubectl -n mayastor logs mayastor-5qf82 -c message-bus-probe

krishnakekan619 commented 3 years ago

hi @gila below is logs you have asked for

# kubectl -n mayastor logs mayastor-5qf82 -c message-bus-probe
Waiting for message bus...
nc: bad address 'nats'
Waiting for message bus...
nc: bad address 'nats'
nc: bad address 'nats'
Waiting for message bus...

# kubectl -n mayastor logs mayastor-5dr45 -c message-bus-probe
nc: bad address 'nats'
Waiting for message bus...
nc: bad address 'nats'
Waiting for message bus...
nc: bad address 'nats'
Waiting for message bus...
nc: bad address 'nats'
Waiting for message bus...
tiagolobocastro commented 3 years ago

hmm this seems to be some kind of DNS issue if we can't access nats by name..

krishnakekan619 commented 3 years ago

hi @tiagolobocastro, @jkryl i can access nats service by its name i.e. nats

gila commented 3 years ago

@krishnakekan619 I've had a similar issue today, strangely enough. I had to restart kube-router and coredns pods and then the services we're able to resolve properly.

krishnakekan619 commented 3 years ago

@gila we have restarted our dns-utils pods several times but still the error persists.