sensu / sensu-operator

An operator to manage Sensu 2.0 clusters
MIT License
8 stars 5 forks source link

Pod sometimes stuck with first init container #5

Open schu opened 5 years ago

schu commented 5 years ago

Pods by default get a check-dns init container. Sometimes a new pod doesn't get past that first init container and hangs:

NAME                                                   READY     STATUS     RESTARTS   AGE
example-sensu-cluster-gzwbrntcd6                       0/2       Init:0/3   0          4m

Example description (kubectl describe pod ...):

Init Containers:
  check-dns:
    Container ID:  docker://5a1629435ad0067359b80ff5c82c8f058aa90f8a03cd36bd835751eb6da34340
    Image:         busybox:1.28.0-glibc
    Image ID:      docker-pullable://busybox@sha256:0b55a30394294ab23b9afd58fab94e61a923f5834fba7ddbae7f8e0c11ba85e6
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c

          TIMEOUT_READY=0
          while ( ! nslookup example-sensu-cluster-gzwbrntcd6.example-sensu-cluster.default.svc )
          do
            # If TIMEOUT_READY is 0 we should never time out and exit
            TIMEOUT_READY=$(( TIMEOUT_READY-1 ))
                        if [ $TIMEOUT_READY -eq 0 ];
                                  then
                                      echo "Timed out waiting for DNS entry"
                                      exit 1
                                  fi
                              sleep 1
                            done
    State:          Running
      Started:      Wed, 18 Jul 2018 14:06:05 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>

The response from kube-dns is NXDOMAIN but that's also the case for a successful setup and nslookup still returns 0 then:

Server:     10.96.0.10
Address:    10.96.0.10:53

** server can't find example-sensu-cluster-gzwbrntcd6.example-sensu-cluster.default.svc: NXDOMAIN

*** Can't find example-sensu-cluster-gzwbrntcd6.example-sensu-cluster.default.svc: No answer

I see the issue happening for example after creating a new cluster and doing a restore:

kubectl apply -f example/example-sensu-cluster.yaml
./example/restore-operator/restore-backup --cluster-name=example-sensu-cluster --aws-bucket-name=sensu-backup-test --backup-name=sensu-cluster-backup-1531893564

If the pod gets stuck, I redo the restore operation and it usually works then:

kubectl delete sensurestore example-sensu-cluster
./example/restore-operator/restore-backup --cluster-name=example-sensu-cluster --aws-bucket-name=sensu-backup-test --backup-name=sensu-cluster-backup-1531893564
schu commented 5 years ago

So the NXDOMAIN above actually is an unrelated problem due to https://github.com/docker-library/busybox/issues/48 But the operator uses an older image, busybox:1.28.0-glibc, so this issue is caused by something else.

kubectl run --rm -ti --restart=Never --image=busybox:1.28.0-glibc busybox can be used to test.

iaguis commented 5 years ago

Have you tried using -type=a like suggested in https://bugs.busybox.net/show_bug.cgi?id=11161#c4? Maybe it's worth a try?

schu commented 5 years ago

No I haven't. I meant to say: https://github.com/docker-library/busybox/issues/48 is not an issue for us, since we use an older image, so I don't think we should need the -type=a workaround.

schu commented 5 years ago

I haven't managed to reproduce the bug yet today in a few attempts.