rabbitmq / discussions

Please use RabbitMQ mailing list for questions. Issues that are questions, discussions or lack details necessary to investigate them are moved to this repository.
3 stars 4 forks source link

Cannot Setup Rabbitmq container after node's docker service restart #166

Closed 01045972746 closed 3 years ago

01045972746 commented 3 years ago

Describe the bug

After dev-kube-node002-qew restart docker service, Rabbitmq pod cannot being initialized.

$ kubectl get po -n myns -o wide
NAME                                 READY   STATUS                  RESTARTS   AGE    IP             NODE                   NOMINATED NODE   READINESS GATES
abc-rabbitmq-server-0                1/1     Running                 0          53m    192.135.70.70   dev-kube-node005-qew   <none>           <none>
abc-rabbitmq-server-1                1/1     Running                 0          52m    192.135.68.66   dev-kube-node003-qew   <none>           <none>
abc-rabbitmq-server-2                0/1     Init:CrashLoopBackOff   8          51m    192.135.71.59   dev-kube-node002-qew   <none>           <none>

Please Check this describe.

$ kubectl describe -n myns pod abc-rabbitmq-server-2
..........................
  initContainerStatuses:
  - containerID: docker://20dca5b69e41a8451c0dff4e33c14f78b7ce427886418bcd7b4a0547154d917c
    image: curlimages/curl:latest
    imageID: docker-pullable://curlimages/curl@sha256:5329ee280d3d91f3e48885f18c884af5907b68c6aa80f411927a5a28c4f5df07
    lastState: {}
    name: copy-community-plugins
    ready: true
    restartCount: 1
    state:
      terminated:
        containerID: docker://20dca5b69e41a8451c0dff4e33c14f78b7ce427886418bcd7b4a0547154d917c
        exitCode: 0
        finishedAt: "2020-12-09T10:00:07Z"
        reason: Completed
        startedAt: "2020-12-09T10:00:06Z"
  - containerID: docker://0a1baed54312990e0b9256e5d96b6ca5ff059edfbccc39e2dc0c08969e5d62f9
    image: rabbitmq:3.8.9-management
    imageID: docker-pullable://rabbitmq@sha256:48421d944fcb6efb7508b288e0737958cc9091395ff5979b2d80ed457001f648
    lastState:
      terminated:
    lastState: {}
    name: rabbitmq
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: docker://e81e72fb24cbf635ca4f452b8945aa727771c2ea21cf7d44e411a04ddc07d945
        exitCode: 0
        finishedAt: "2020-12-09T09:59:11Z"
        reason: Completed
        startedAt: "2020-12-09T09:22:48Z"
  hostIP: 10.113.85.11
  initContainerStatuses:
  - containerID: docker://20dca5b69e41a8451c0dff4e33c14f78b7ce427886418bcd7b4a0547154d917c
    image: curlimages/curl:latest
    imageID: docker-pullable://curlimages/curl@sha256:5329ee280d3d91f3e48885f18c884af5907b68c6aa80f411927a5a28c4f5df07
    lastState: {}
    name: copy-community-plugins
    ready: true
    restartCount: 1
    state:
      terminated:
        containerID: docker://20dca5b69e41a8451c0dff4e33c14f78b7ce427886418bcd7b4a0547154d917c
        exitCode: 0
        finishedAt: "2020-12-09T10:00:07Z"
        reason: Completed
        startedAt: "2020-12-09T10:00:06Z"
  - containerID: docker://0a1baed54312990e0b9256e5d96b6ca5ff059edfbccc39e2dc0c08969e5d62f9
    image: rabbitmq:3.8.9-management
    imageID: docker-pullable://rabbitmq@sha256:48421d944fcb6efb7508b288e0737958cc9091395ff5979b2d80ed457001f648
    lastState:
      terminated:
        containerID: docker://0a1baed54312990e0b9256e5d96b6ca5ff059edfbccc39e2dc0c08969e5d62f9
        exitCode: 2
        finishedAt: "2020-12-09T10:05:50Z"
        reason: Error
        startedAt: "2020-12-09T10:05:50Z"
    name: setup-container
    ready: false
    restartCount: 6
    state:
      waiting:
        message: back-off 5m0s restarting failed container=setup-container pod=abc-rabbitmq-server-2_myns(a20b245c-b767-4b97-b8c0-e8fe241ae5ae)
        reason: CrashLoopBackOff

Please Check this Error Message

$ kubectl -n myns logs abc-rabbitmq-server-2 -c setup-container
cp: cannot create regular file '/var/lib/rabbitmq/.erlang.cookie': Permission denied
cp: cannot create regular file '/operator/enabled_plugins': Permission denied
sh: 1: cannot create /var/lib/rabbitmq/.rabbitmqadmin.conf: Permission denied

To Reproduce

Refer my yaml file.

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  namespace: myns
  name: abc-rabbitmq
spec:
  service:
    type: LoadBalancer
  replicas: 3
  resources:
    requests:
      cpu: 4000m
      memory: 8Gi
    limits:
      cpu: 4000m
      memory: 8Gi
  persistence:
    storageClassName: rabbitmq-sc
    storage: 20Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: "app"
              operator: In
              values:
              - abc-rabbitmq
        topologyKey: "kubernetes.io/hostname"
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                volumeMounts:
                  - mountPath: /opt/rabbitmq/community-plugins
                    name: community-plugins
            volumes:
              - name: community-plugins
                emptyDir: {}
            initContainers:
              - command:
                  - sh
                  - -c
                  - curl -L -v https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/releases/download/3.8.9/rabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez --output /community-plugins/rabbitmq_delayed_message_exchange-3.8.9.ez
                image: curlimages/curl
                imagePullPolicy: IfNotPresent
                name: copy-community-plugins
                resources:
                  limits:
                    cpu: 100m
                    memory: 500Mi
                  requests:
                    cpu: 100m
                    memory: 500Mi
                terminationMessagePolicy: FallbackToLogsOnError
                volumeMounts:
                  - mountPath: /community-plugins/
                    name: community-plugins
  rabbitmq:
    additionalPlugins:
      - rabbitmq_delayed_message_exchange
    envConfig: |
      PLUGINS_DIR=/opt/rabbitmq/plugins:/opt/rabbitmq/community-plugins

And Please check my StorageClass with netapp trident.

$ kubectl get sc
NAME                  PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rabbitmq-sc   csi.trident.netapp.io   Delete          Immediate           false                  24h

When First time apply the yaml, it was set successfully. But, when any event occurs, it failed with this error.

Expected behavior Success setup Rabbitmq Pod with Node restart or Docker service restart

Version and environment information

Gsantomaggio commented 3 years ago

Hi @01045972746 , Thank you for reporting the problem. To analyze the problem we'd need some other information, the message restarting failed doesn't say too much.

There should be something on the logs/events.

01045972746 commented 3 years ago

Thanks for your help. Please check this.

Rabbit Log

2020-12-09 09:23:17.657 [info] <0.44.0> Application rabbitmq_peer_discovery_common started on node 'rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns'
2020-12-09 09:23:17.671 [info] <0.44.0> Application rabbitmq_peer_discovery_k8s started on node 'rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns'
2020-12-09 09:23:17.689 [info] <0.839.0> Prometheus metrics: HTTP (non-TLS) listener started on port 15692
2020-12-09 09:23:17.690 [info] <0.44.0> Application rabbitmq_prometheus started on node 'rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns'
2020-12-09 09:23:17.690 [info] <0.623.0> Ready to start client connection listeners
2020-12-09 09:23:17.693 [info] <0.958.0> started TCP listener on [::]:5672
 completed with 7 plugins.
2020-12-09 09:23:18.085 [info] <0.623.0> Server startup complete; 7 plugins started.
 * rabbitmq_prometheus
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 * rabbitmq_delayed_message_exchange
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
2020-12-09 09:23:18.085 [info] <0.623.0> Resetting node maintenance status
2020-12-09 09:23:21.183 [info] <0.973.0> Feature flag `drop_unroutable_metric`: mark as enabled=state_changing
2020-12-09 09:23:21.193 [info] <0.973.0> Feature flags: list of feature flags found:
2020-12-09 09:23:21.193 [info] <0.973.0> Feature flags:   [~] drop_unroutable_metric
2020-12-09 09:23:21.193 [info] <0.973.0> Feature flags:   [ ] empty_basic_get_metric
2020-12-09 09:23:21.193 [info] <0.973.0> Feature flags:   [x] implicit_default_bindings
2020-12-09 09:23:21.193 [info] <0.973.0> Feature flags:   [x] maintenance_mode_status
2020-12-09 09:23:21.194 [info] <0.973.0> Feature flags:   [x] quorum_queue
2020-12-09 09:23:21.194 [info] <0.973.0> Feature flags:   [x] virtual_host_metadata
2020-12-09 09:23:21.194 [info] <0.973.0> Feature flags: feature flag states written to disk: yes
2020-12-09 09:23:21.288 [info] <0.975.0> Feature flag `drop_unroutable_metric`: mark as enabled=true
2020-12-09 09:23:21.323 [info] <0.975.0> Feature flags: list of feature flags found:
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags:   [x] drop_unroutable_metric
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags:   [ ] empty_basic_get_metric
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags:   [x] implicit_default_bindings
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags:   [x] maintenance_mode_status
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags:   [x] quorum_queue
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags:   [x] virtual_host_metadata
2020-12-09 09:23:21.324 [info] <0.975.0> Feature flags: feature flag states written to disk: yes
2020-12-09 09:23:21.858 [info] <0.979.0> Feature flag `empty_basic_get_metric`: mark as enabled=state_changing
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags: list of feature flags found:
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags:   [x] drop_unroutable_metric
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags:   [~] empty_basic_get_metric
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags:   [x] implicit_default_bindings
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags:   [x] maintenance_mode_status
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags:   [x] quorum_queue
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags:   [x] virtual_host_metadata
2020-12-09 09:23:21.868 [info] <0.979.0> Feature flags: feature flag states written to disk: yes
2020-12-09 09:23:22.747 [info] <0.981.0> Feature flag `empty_basic_get_metric`: mark as enabled=true
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags: list of feature flags found:
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags:   [x] drop_unroutable_metric
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags:   [x] empty_basic_get_metric
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags:   [x] implicit_default_bindings
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags:   [x] maintenance_mode_status
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags:   [x] quorum_queue
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags:   [x] virtual_host_metadata
2020-12-09 09:23:22.780 [info] <0.981.0> Feature flags: feature flag states written to disk: yes
2020-12-09 09:59:06.915 [info] <0.60.0> SIGTERM received - shutting down
2020-12-09 09:59:06.928 [warning] <0.683.0> HTTP listener registry could not find context rabbitmq_prometheus_tls
2020-12-09 09:59:06.939 [warning] <0.683.0> HTTP listener registry could not find context rabbitmq_management_tls
2020-12-09 09:59:06.947 [info] <0.273.0> Will unregister with peer discovery backend rabbit_peer_discovery_k8s
2020-12-09 09:59:06.947 [info] <0.958.0> stopped TCP listener on [::]:5672
2020-12-09 09:59:06.949 [info] <0.571.0> Closing all connections in vhost '/' on node 'rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns' because the vhost is stopping
2020-12-09 09:59:06.949 [info] <0.587.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2020-12-09 09:59:07.054 [info] <0.587.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2020-12-09 09:59:07.055 [info] <0.583.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2020-12-09 09:59:07.146 [info] <0.583.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped

K8S events

$ kubectl get event --namespace myns --field-selector involvedObject.name=abc-rabbitmq-server-2
LAST SEEN   TYPE      REASON    OBJECT                      MESSAGE
33s         Warning   BackOff   pod/abc-rabbitmq-server-2   Back-off restarting failed container

i considered which is systematic, but another pod(ex. prometheus) at using the same SC has not occurred permission error. And this permission error didn't occur when this pod initialized at first time. Thanks.

01045972746 commented 3 years ago

@Gsantomaggio , I tested with this docs, https://github.com/rabbitmq/cluster-operator/tree/main/docs/examples/community-plugins

i think it maybe can occur PVC re-create issue. When i using my customized image which made with plugin file install,

FROM rabbitmq:3.8.9-management
RUN apt-get update
RUN apt-get install -y curl
RUN curl -L https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/releases/download/3.8.9/rabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez > $RABBITMQ_HOME/plugins/rabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez
RUN chown rabbitmq:rabbitmq $RABBITMQ_HOME/plugins/rabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  namespace: myns
  name: aab-rabbitmq
spec:
  service:
    type: LoadBalancer
  replicas: 3
  resources:
    requests:
      cpu: 4000m
      memory: 8Gi
    limits:
      cpu: 4000m
      memory: 8Gi
  persistence:
    storageClassName: rabbitmq-sc
    storage: 20Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: "app"
              operator: In
              values:
              - aab-rabbitmq
        topologyKey: "kubernetes.io/hostname"
  image: {PRIVATE_URL}/{PRIVATE}/rabbitmq-delayed-message
  imagePullSecrets:
  - name: aab-secret
  rabbitmq:
    additionalPlugins:
    - rabbitmq_delayed_message_exchange

with this pod, there are no errors or issues after docker service restart, node restart

michaelklishin commented 3 years ago

All the above log says is that the node was told to shut down using a SIGTERM:

2020-12-09 09:59:06.915 [info] <0.60.0> SIGTERM received - shutting down
01045972746 commented 3 years ago

All the above log says is that the node was told to shut down using a SIGTERM:

2020-12-09 09:59:06.915 [info] <0.60.0> SIGTERM received - shutting down

Yeap. That timing is when i stopped docker service. Then restart docker service, That pod doesn't go RUNNING state.

Gsantomaggio commented 3 years ago

We'd need the logs during the startup (when the error happens), not the logs during the shutdown.

michaelklishin commented 3 years ago

No evidence of an issue in the Operator provided after several days => moving to a discussion.

michaelklishin commented 3 years ago

@01045972746 please provide full logs from all RabbitMQ nodes at debug level after they restart. We will not be able to help you without this information. We do not guess in this community.

01045972746 commented 3 years ago

node-0

2020-12-12 05:30:02.134 [info] <0.604.0> rabbit on node 'rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns' down
2020-12-12 05:30:02.177 [debug] <0.908.0> POST https://kubernetes.default:443/api/v1/namespaces/myns/events/ ["{\"count\":1,\"involvedObject\":{\"apiVersion\":\"v1\",\"kind\":\"RabbitMQ\",\"name\":\"pod/abc-rabbitmq-server-0\",\"namespace\":\"myns\"},\"lastTimestamp\":\"2020-12-12T05:30:02+00:00\",\"message\":\"Node rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns is down or disconnected \",\"metadata\":{\"name\":\"abc-rabbitmq-server-0.1607751002176\",\"namespace\":\"myns\"},\"reason\":\"NodeDown\",\"source\":{\"component\":\"abc-rabbitmq-server-0/rabbitmq_peer_discovery\",\"host\":\"abc-rabbitmq-server-0\"},\"type\":\"Warning\"}"]
2020-12-12 05:30:02.187 [debug] <0.908.0> Response: [{ok,{{"HTTP/1.1",201,"Created"},[{"cache-control","no-cache, private"},{"date","Sat, 12 Dec 2020 05:30:02 GMT"},{"content-length","1130"},{"content-type","application/json"}],"{\"kind\":\"Event\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"abc-rabbitmq-server-0.1607751002176\",\"namespace\":\"myns\",\"selfLink\":\"/api/v1/namespaces/myns/events/abc-rabbitmq-server-0.1607751002176\",\"uid\":\"a5017596-97d9-4458-b11a-4d38d20c0d6e\",\"resourceVersion\":\"10056147\",\"creationTimestamp\":\"2020-12-12T05:30:02Z\",\"managedFields\":[{\"manager\":\"unknown\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2020-12-12T05:30:02Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:count\":{},\"f:involvedObject\":{\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{}},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}}}]},\"involvedObject\":{\"kind\":\"RabbitMQ\",\"namespace\":\"myns\",\"name\":\"pod/abc-rabbitmq-server-0\",\"apiVersion\":\"v1\"},\"reason\":\"NodeDown\",\"message\":\"Node rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns is down or disconnected \",\"source\":{\"component\":\"abc-rabbitmq-server-0/rabbitmq_peer_discovery\",\"host\":\"abc-rabbitmq-server-0\"},\"firstTimestamp\":null,\"lastTimestamp\":\"2020-12-12T05:30:02Z\",\"count\":1,\"type\":\"Warning\",\"eventTime\":null,\"reportingComponent\":\"\",\"reportingInstance\":\"\"}\n"}}]
2020-12-12 05:30:02.244 [info] <0.604.0> Node rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns is down, deleting its listeners
2020-12-12 05:30:02.245 [info] <0.604.0> node 'rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns' down: connection_closed
2020-12-12 05:30:03.603 [debug] <0.600.0> Asked to [re-]register this node (rabbit@abc-rabbitmq-server-0.abc-rabbitmq-nodes.myns) with epmd...
2020-12-12 05:30:03.779 [debug] <0.600.0> [Re-]registered this node (rabbit@abc-rabbitmq-server-0.abc-rabbitmq-nodes.myns) with epmd at port 25672
2020-12-12 05:31:03.779 [debug] <0.600.0> Asked to [re-]register this node (rabbit@abc-rabbitmq-server-0.abc-rabbitmq-nodes.myns) with epmd...
2020-12-12 05:31:03.939 [debug] <0.600.0> [Re-]registered this node (rabbit@abc-rabbitmq-server-0.abc-rabbitmq-nodes.myns) with epmd at port 25672

node2

2020-12-12 05:30:02.134 [info] <0.550.0> rabbit on node 'rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns' down
2020-12-12 05:30:02.177 [debug] <0.836.0> POST https://kubernetes.default:443/api/v1/namespaces/myns/events/ ["{\"count\":1,\"involvedObject\":{\"apiVersion\":\"v1\",\"kind\":\"RabbitMQ\",\"name\":\"pod/abc-rabbitmq-server-2\",\"namespace\":\"myns\"},\"lastTimestamp\":\"2020-12-12T05:30:02+00:00\",\"message\":\"Node rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns is down or disconnected \",\"metadata\":{\"name\":\"abc-rabbitmq-server-2.1607751002176\",\"namespace\":\"myns\"},\"reason\":\"NodeDown\",\"source\":{\"component\":\"abc-rabbitmq-server-2/rabbitmq_peer_discovery\",\"host\":\"abc-rabbitmq-server-2\"},\"type\":\"Warning\"}"]
2020-12-12 05:30:02.183 [debug] <0.836.0> Response: [{ok,{{"HTTP/1.1",201,"Created"},[{"cache-control","no-cache, private"},{"date","Sat, 12 Dec 2020 05:30:02 GMT"},{"content-length","1130"},{"content-type","application/json"}],"{\"kind\":\"Event\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"abc-rabbitmq-server-2.1607751002176\",\"namespace\":\"myns\",\"selfLink\":\"/api/v1/namespaces/myns/events/abc-rabbitmq-server-2.1607751002176\",\"uid\":\"4a1bf2d8-e4ee-4598-8c06-b1e5dc476caf\",\"resourceVersion\":\"10056146\",\"creationTimestamp\":\"2020-12-12T05:30:02Z\",\"managedFields\":[{\"manager\":\"unknown\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2020-12-12T05:30:02Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:count\":{},\"f:involvedObject\":{\"f:apiVersion\":{},\"f:kind\":{},\"f:name\":{},\"f:namespace\":{}},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:source\":{\"f:component\":{},\"f:host\":{}},\"f:type\":{}}}]},\"involvedObject\":{\"kind\":\"RabbitMQ\",\"namespace\":\"myns\",\"name\":\"pod/abc-rabbitmq-server-2\",\"apiVersion\":\"v1\"},\"reason\":\"NodeDown\",\"message\":\"Node rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns is down or disconnected \",\"source\":{\"component\":\"abc-rabbitmq-server-2/rabbitmq_peer_discovery\",\"host\":\"abc-rabbitmq-server-2\"},\"firstTimestamp\":null,\"lastTimestamp\":\"2020-12-12T05:30:02Z\",\"count\":1,\"type\":\"Warning\",\"eventTime\":null,\"reportingComponent\":\"\",\"reportingInstance\":\"\"}\n"}}]
2020-12-12 05:30:02.255 [info] <0.550.0> Node rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns is down, deleting its listeners
2020-12-12 05:30:02.256 [info] <0.550.0> node 'rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns' down: connection_closed
2020-12-12 05:30:13.476 [debug] <0.546.0> Asked to [re-]register this node (rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns) with epmd...
2020-12-12 05:30:13.647 [debug] <0.546.0> [Re-]registered this node (rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns) with epmd at port 25672
2020-12-12 05:31:13.647 [debug] <0.546.0> Asked to [re-]register this node (rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns) with epmd...
2020-12-12 05:31:13.792 [debug] <0.546.0> [Re-]registered this node (rabbit@abc-rabbitmq-server-2.abc-rabbitmq-nodes.myns) with epmd at port 25672

node1 (which node restart docker)

2020-12-12 05:29:32.748 [info] <0.1194.0> Feature flags: feature flag states written to disk: yes
2020-12-12 05:30:01.913 [info] <0.60.0> SIGTERM received - shutting down
2020-12-12 05:30:01.926 [warning] <0.678.0> HTTP listener registry could not find context rabbitmq_prometheus_tls
2020-12-12 05:30:01.935 [warning] <0.678.0> HTTP listener registry could not find context rabbitmq_management_tls
2020-12-12 05:30:01.942 [info] <0.273.0> Will unregister with peer discovery backend rabbit_peer_discovery_k8s
2020-12-12 05:30:01.942 [info] <0.953.0> stopped TCP listener on [::]:5672
2020-12-12 05:30:01.944 [info] <0.567.0> Closing all connections in vhost '/' on node 'rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns' because the vhost is stopping
2020-12-12 05:30:01.944 [info] <0.583.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2020-12-12 05:30:02.044 [info] <0.583.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2020-12-12 05:30:02.044 [info] <0.579.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2020-12-12 05:30:02.131 [info] <0.579.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@abc-rabbitmq-server-1.abc-rabbitmq-nodes.myns/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped

pod status

NAME                    READY   STATUS                  RESTARTS   AGE     IP             NODE                   NOMINATED NODE   READINESS GATES
abc-rabbitmq-server-0  1/1     Running                 0          12m     {PRIVATE}   {PRIVATE}   <none>           <none>
abc-rabbitmq-server-1   0/1     Init:CrashLoopBackOff   6          11m     {PRIVATE}   {PRIVATE}   <none>           <none>
abc-rabbitmq-server-2   1/1     Running                 0          9m43s   {PRIVATE}   {PRIVATE}     <none>           <none>

node1 describe pod

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Warning  FailedScheduling        3m5s                 default-scheduler        running "VolumeBinding" filter plugin for pod "abc-rabbitmq-server-1": pod has unbound immediate PersistentVolumeClaims
  Normal   Scheduled               2m43s                default-scheduler        Successfully assigned myns/abc-rabbitmq-server-1 to dev-kube-node002-ncl
  Warning  FailedScheduling        3m5s                 default-scheduler        running "VolumeBinding" filter plugin for pod "abc-rabbitmq-server-1": pod has unbound immediate PersistentVolumeClaims
  Normal   SuccessfulAttachVolume  2m43s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-d0ac2a62-2762-4227-9445-da2b5136d293"
  Normal   Pulled                  2m33s                kubelet                  Container image "rabbitmq:3.8.9-management" already present on machine
  Normal   Created                 2m33s                kubelet                  Created container setup-container
  Normal   Started                 2m33s                kubelet                  Started container setup-container
  Normal   Pulled                  2m32s                kubelet                  Container image "rabbitmq:3.8.9-management" already present on machine
  Normal   Created                 2m32s                kubelet                  Created container rabbitmq
  Normal   Started                 2m32s                kubelet                  Started container rabbitmq
  Warning  Unhealthy               20s (x3 over 2m20s)  kubelet                  Readiness probe failed: dial tcp {PRIVATE}:5672: connect: connection refused
  Normal   Killing                 20s                  kubelet                  Stopping container rabbitmq
  Warning  FailedPreStopHook       20s                  kubelet                  Exec lifecycle hook ([/bin/bash -c if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0; fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800; rabbitmq-upgrade await_online_synchronized_mirror -t 604800; rabbitmq-upgrade drain -t 604800]) for Container "rabbitmq" in Pod "abc-rabbitmq-server-1_myns(d01fe635-899a-47a5-911b-93e219b0352a)" failed - error: rpc error: code = Unknown desc = cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory, message: ""
  Normal   Created                 16s (x2 over 2m35s)  kubelet                  Created container copy-community-plugins
  Normal   Started                 16s (x2 over 2m35s)  kubelet                  Started container copy-community-plugins
  Normal   Pulled                  16s (x2 over 2m35s)  kubelet                  Container image "curlimages/curl" already present on machine
  Warning  Unhealthy               10s                  kubelet                  Readiness probe failed: dial tcp :5672: connect: connection refused
  Normal   SandboxChanged          5s (x2 over 20s)     kubelet                  Pod sandbox changed, it will be killed and re-created.

node 1 setup-container(initContainers) log

$ kubectl -n myns logs abc-rabbitmq-server-1 -c setup-container
cp: cannot create regular file '/var/lib/rabbitmq/.erlang.cookie': Permission denied
cp: cannot create regular file '/operator/enabled_plugins': Permission denied
sh: 1: cannot create /var/lib/rabbitmq/.rabbitmqadmin.conf: Permission denied

node 1 copy-community-plugins log

$ kubectl -n myns logs abc-rabbitmq-server-1 -c copy-community-plugins
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.78.231.108:443...
* Connected to github.com (52.78.231.108) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /cacert.pem
*  CApath: none
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [25 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2976 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [36 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [36 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=github.com
*  start date: May  5 00:00:00 2020 GMT
*  expire date: May 10 12:00:00 2022 GMT
*  subjectAltName: host "github.com" matched cert's "github.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
*  SSL certificate verify ok.
} [5 bytes data]
> GET /rabbitmq/rabbitmq-delayed-message-exchange/releases/download/3.8.9/rabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez HTTP/1.1
> Host: github.com
> User-Agent: curl/7.73.0-DEV
> Accept: */*
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found
< date: Sat, 12 Dec 2020 05:28:56 GMT
< content-type: text/html; charset=utf-8
< server: GitHub.com
< status: 302 Found
< vary: X-PJAX, Accept-Encoding, Accept, X-Requested-With, Accept-Encoding
< location: https://github-production-release-asset-2e65be.s3.amazonaws.com/32327910/fb8cf200-1884-11eb-8a3a-86e8864bf669?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201212%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201212T052856Z&X-Amz-Expires=300&X-Amz-Signature=8e621a0a1314ff0cda2aa52ab6d14d4d351f91ef085a5a5bdc1513a7f28b7cca&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=32327910&response-content-disposition=attachment%3B%20filename%3Drabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez&response-content-type=application%2Foctet-stream
< cache-control: no-cache
< strict-transport-security: max-age=31536000; includeSubdomains; preload
< x-frame-options: deny
< x-content-type-options: nosniff
< x-xss-protection: 1; mode=block
< referrer-policy: no-referrer-when-downgrade
< expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
{ [5 bytes data]
< content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events wss://alive.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com *.githubusercontent.com; manifest-src 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/socket-worker-5029ae85.js gist.github.com/socket-worker-5029ae85.js
< Set-Cookie: _gh_sess=mFpZCrjnzwgwCEkSNoqoAkewi1bOg%2FjR7uxrEcISg0ErwJm29O%2F6dduGBFPerMAKlGNfoF64KYnKiye%2FkgG%2B1nNvwglbUar8anut4ohKam2oZzn17tutx0%2FfnweyVtKFCKPdG%2FztVLui%2FlIVU1bA5axZBFawJr%2BX0i4efznQd9R439Xg3sQrBFWdIy%2FOWo9M67N4ec3EF6clXahp15oBOuc3VWUnPfyThldR0tz9a5XN6jzkAGhiI43VpDMmrz3sCGZeA6FqSULyo9AiNS3dOQ%3D%3D--TjcmOZLAJpVh3jd9--L1cBs8hVk0nGoozl7iyRUw%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
< Set-Cookie: _octo=GH1.1.1255426443.1607751020; Path=/; Domain=github.com; Expires=Sun, 12 Dec 2021 05:30:20 GMT; Secure; SameSite=Lax
{ [5 bytes data]
< Set-Cookie: logged_in=no; Path=/; Domain=github.com; Expires=Sun, 12 Dec 2021 05:30:20 GMT; HttpOnly; Secure; SameSite=Lax
< Content-Length: 675
< X-GitHub-Request-Id: 42E5:5801:DB0DA:13D36B:5FD4556C
<
* Ignoring the response-body
{ [675 bytes data]
100   675  100   675    0     0   4753      0 --:--:-- --:--:-- --:--:--  4787
* Connection #0 to host github.com left intact
* Issue another request to this URL: 'https://github-production-release-asset-2e65be.s3.amazonaws.com/32327910/fb8cf200-1884-11eb-8a3a-86e8864bf669?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201212%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201212T052856Z&X-Amz-Expires=300&X-Amz-Signature=8e621a0a1314ff0cda2aa52ab6d14d4d351f91ef085a5a5bdc1513a7f28b7cca&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=32327910&response-content-disposition=attachment%3B%20filename%3Drabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez&response-content-type=application%2Foctet-stream'
*   Trying 52.217.89.100:443...
* Connected to github-production-release-asset-2e65be.s3.amazonaws.com (52.217.89.100) port 443 (#1)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /cacert.pem
*  CApath: none
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [91 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2900 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; ST=Washington; L=Seattle; O=Amazon.com, Inc.; CN=*.s3.amazonaws.com
*  start date: Nov  9 00:00:00 2019 GMT
*  expire date: Mar 12 12:00:00 2021 GMT
*  subjectAltName: host "github-production-release-asset-2e65be.s3.amazonaws.com" matched cert's "*.s3.amazonaws.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert Baltimore CA-2 G2
*  SSL certificate verify ok.
} [5 bytes data]
> GET /32327910/fb8cf200-1884-11eb-8a3a-86e8864bf669?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201212%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201212T052856Z&X-Amz-Expires=300&X-Amz-Signature=8e621a0a1314ff0cda2aa52ab6d14d4d351f91ef085a5a5bdc1513a7f28b7cca&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=32327910&response-content-disposition=attachment%3B%20filename%3Drabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez&response-content-type=application%2Foctet-stream HTTP/1.1
> Host: github-production-release-asset-2e65be.s3.amazonaws.com
> User-Agent: curl/7.73.0-DEV
> Accept: */*
>
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< x-amz-id-2: 4SVfUbs4FriGSszLaNdWjZSnooFD8TDwaUI6Os8vCR4YhEUb4jWi2X+uFjkEXlDbvNLlAhdmgNA=
< x-amz-request-id: A677B4EA0A34C63B
< Date: Sat, 12 Dec 2020 05:30:22 GMT
< Last-Modified: Tue, 27 Oct 2020 15:48:22 GMT
< ETag: "f88f04dec9376b2a4da60d2dbed0131f"
< Content-Disposition: attachment; filename=rabbitmq_delayed_message_exchange-3.8.9-0199d11c.ez
< Accept-Ranges: bytes
< Content-Type: application/octet-stream
< Content-Length: 51047
< Server: AmazonS3
<
{ [5 bytes data]
100 51047  100 51047    0     0  28249      0  0:00:01  0:00:01 --:--:-- 49083
* Connection #1 to host github-production-release-asset-2e65be.s3.amazonaws.com left intact

If there is anything you want, please tell me. Thx :blush:

Gsantomaggio commented 3 years ago

the problem is: pod has unbound immediate PersistentVolumeClaims and the node1 is not able to bind the pvc. This kind of problem is (often) related to the accessModes configuration.

I suppose that during the reboot the PVC is still in use and the POD is not able to bind it. A workaround could be setting: accessModes:ReadWriteMany

There are several posts about that btw.

And also reboot docker-service in a running cluster, in general, isn't a good practice.

Gsantomaggio commented 3 years ago

..and please check the PVC(s) status during the issue:

kubectl describe pvc  << "PVCNAME" >> 
01045972746 commented 3 years ago

the problem is: pod has unbound immediate PersistentVolumeClaims and the node1 is not able to bind the pvc. This kind of problem is (often) related to the accessModes configuration.

I suppose that during the reboot the PVC is still in use and the POD is not able to bind it. A workaround could be setting: accessModes:ReadWriteMany

There are several posts about that btw.

And also reboot docker-service in a running cluster, in general, isn't a good practice.

Yeap. i know it sounds that it isn't a normal situation. In Service level, K8S nodes can be rebooted anytime if what node cannot be reached k8s's health check. So i tested it intentionally.

Anyway, Thank you very much for the help. when i changed PVC to ReadWriteMany, it works. :blush:

Gsantomaggio commented 3 years ago

good :)