If 3 pods start at the same time, sometimes cluster becomes partitioned

gjcarneiro commented 6 years ago

This is my k8s Deployment:


apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: api-celery-rabbit
  name: api-celery-rabbit
spec:
  replicas: 3
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: api-celery-rabbit
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: api-celery-rabbit
    spec:
      containers:
      - command:
        - sh
        - -c
        - |
          set -e

          cat <<EOF > /etc/rabbitmq/rabbitmq.conf
          ## Clustering
          cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
          cluster_formation.k8s.service_name =  api-celery-rabbit
          cluster_formation.k8s.address_type = ip
          cluster_formation.k8s.host = kubernetes.default
          cluster_formation.node_cleanup.interval = 10
          cluster_formation.node_cleanup.only_log_warning = false
          cluster_partition_handling = autoheal
          ## queue master locator
          queue_master_locator=min-masters
          ## enable guest user
          loopback_users.guest = false
          EOF

          echo "[rabbitmq_management,rabbitmq_peer_discovery_k8s]." > /etc/rabbitmq/enabled_plugins

          sleep $(awk 'BEGIN {srand(); printf "%d\n", rand()*30}')
          exec docker-entrypoint.sh rabbitmq-server
        env:
        - name: RABBITMQ_VM_MEMORY_HIGH_WATERMARK
          value: "0.50"
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: RABBITMQ_ERLANG_COOKIE
          value: secretcookiehere
        - name: RABBITMQ_NODENAME
          value: rabbit@$(MY_POD_IP)
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        image: docker.gambit/rabbitmq:3.7.3
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - sh
              - -c
              - |
                sleep 30
                rabbitmqctl set_policy ha-all "^celery" '{"ha-mode":"all"}'
        livenessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: rabbitmq
        ports:
        - containerPort: 5672
          name: amqp
          protocol: TCP
        - containerPort: 15672
          name: http
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - rabbitmqctl
            - status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 20m
            memory: 512Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/rabbitmq
          name: data
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: api-celery-rabbit
      serviceAccountName: api-celery-rabbit
      terminationGracePeriodSeconds: 10
      volumes:
      - emptyDir: {}
        name: data

It often needs manual nursing to restart pods slowly one by one until they're all part of the same cluster.

michaelklishin commented 6 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).

We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.

Please post this to rabbitmq-users.

Thank you.

michaelklishin commented 6 years ago

There is an entire documentation section on this, and a note in the Kubernetes one.

As of #23 it will be possible to use randomized startup delay with this plugin, although we strongly recommend stateful sets (or a combination of the two, if you want to be extra sure).

michaelklishin commented 6 years ago

Your deployment also uses cluster_formation.node_cleanup.only_log_warning = false, which was accidentally included into the chart from an example that was never meant to be used as is in production. Unless you understand what it does and the implications, consider not using it.

gjcarneiro commented 6 years ago

Hm, I think documentation improved a lot since last I read this a few weeks ago. Thanks.

rabbitmq / rabbitmq-peer-discovery-k8s

If 3 pods start at the same time, sometimes cluster becomes partitioned #27