real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apache License 2.0
7.37k stars 888 forks source link

[Cluster] Logchannel(MDC) destination not added when DNS name not resolved #1159

Closed kamma-cc closed 3 years ago

kamma-cc commented 3 years ago

In the Kubernetes environment, when the pod restarts, the dns name cannot be resolved temporarily, and a new IP will be assigned after the restart. If the pod of the Leader node restarts, the new Leader node will not be able to resolve the name of the restarted node, resulting in the Logchannel Destination cannot be added. The restarted node cannot join the cluster correctly.

Our workaround:

void onCanvassPosition(final long logLeadershipTermId, final long logPosition, final int followerMemberId)
    {
        if (null != election)
        {
            election.onCanvassPosition(logLeadershipTermId, logPosition, followerMemberId);
        }
        else if (Cluster.Role.LEADER == role)
        {
            final ClusterMember follower = clusterMemberByIdMap.get(followerMemberId);
            if (null != follower && logLeadershipTermId <= leadershipTermId)
            {
                // we add two line below for readd log channel Destination after the pod restarted available
                logPublisher.removeDestination(ctx.isLogMdc(), follower.logEndpoint());
                logPublisher.addDestination(ctx.isLogMdc(), follower.logEndpoint());
                ...
            }
        }
    }

Please advise other way to avoid the problem? Sorry for my bad English.

We use k8s yaml like below:

apiVersion: v1
kind: Service
metadata:
  name: paragon
  labels:
    app: paragon
spec:
  selector:
    app: paragon-node
  clusterIP: None
  ports:
    - port: 20000
      protocol: UDP
      targetPort: 20000
      name: ingress
    - port: 20001
      protocol: UDP
      targetPort: 20001
      name: consensus
    - port: 20002
      protocol: UDP
      targetPort: 20002
      name: log
    - port: 20003
      protocol: UDP
      targetPort: 20003
      name: catchup
    - port: 20004
      protocol: UDP
      targetPort: 20004
      name: log-control
    - port: 8010
      protocol: UDP
      targetPort: 8010
      name: archive
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: paragon-node
spec:
  selector:
    matchLabels:
      app: paragon-node
  serviceName: "paragon"
  replicas: 3
  template:
    metadata:
      labels:
        app: paragon-node
    spec:
      containers:
        - name: paragon-node
          image: paragon-node:d9fc1067bed6e0bc4a9c7c16ea60eb7529c1373e
          imagePullPolicy: Never
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: PARAGON_NODE_0_HOST
              value: paragon-node-0.paragon
            - name: PARAGON_NODE_1_HOST
              value: paragon-node-1.paragon
            - name: PARAGON_NODE_2_HOST
              value: paragon-node-2.paragon
            - name: INGRESS_PORT
              value: "20000"
            - name: CONSENSUS_PORT
              value: "20001"
            - name: LOG_PORT
              value: "20002"
            - name: CATCHUP_PORT
              value: "20003"
            - name: LOG_CONTROL_PORT
              value: "20004"
            - name: ARCHIVE_PORT
              value: "8010"
            - name: JAVA_TOOL_OPTIONS
              value: >-
                -javaagent:/lib/aeron-agent-1.32.0.jar
                -Daeron.event.log=admin
                -Daeron.event.cluster.log=all
                -Dokcoin.paragon.nodeId=auto
                -Daeron.cluster.members="0,$(PARAGON_NODE_0_HOST):$(INGRESS_PORT),$(PARAGON_NODE_0_HOST):$(CONSENSUS_PORT),$(PARAGON_NODE_0_HOST):$(LOG_PORT),$(PARAGON_NODE_0_HOST):$(CATCHUP_PORT),$(PARAGON_NODE_0_HOST):$(ARCHIVE_PORT)|1,$(PARAGON_NODE_1_HOST):$(INGRESS_PORT),$(PARAGON_NODE_1_HOST):$(CONSENSUS_PORT),$(PARAGON_NODE_1_HOST):$(LOG_PORT),$(PARAGON_NODE_1_HOST):$(CATCHUP_PORT),$(PARAGON_NODE_1_HOST):$(ARCHIVE_PORT)|2,$(PARAGON_NODE_2_HOST):$(INGRESS_PORT),$(PARAGON_NODE_2_HOST):$(CONSENSUS_PORT),$(PARAGON_NODE_2_HOST):$(LOG_PORT),$(PARAGON_NODE_2_HOST):$(CATCHUP_PORT),$(PARAGON_NODE_2_HOST):$(ARCHIVE_PORT)|"
                -Daeron.cluster.ingress.channel=aeron:udp?endpoint=0.0.0.0:$(INGRESS_PORT)
                -Daeron.archive.control.channel=aeron:udp?endpoint=$(POD_IP):$(ARCHIVE_PORT)
                -Daeron.cluster.log.channel=aeron:udp?control=$(POD_IP):$(LOG_CONTROL_PORT)
          ports:
            - containerPort: 20000
              name: ingress
              protocol: UDP
            - containerPort: 20001
              name: consensus
              protocol: UDP
            - containerPort: 20002
              name: log
              protocol: UDP
            - containerPort: 20003
              name: catchup
              protocol: UDP
            - containerPort: 20004
              name: log-control
              protocol: UDP
            - containerPort: 8010
              name: archive
              protocol: UDP
          volumeMounts:
            - name: data
              mountPath: /paragon-data
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 1Gi
mjpt777 commented 3 years ago

What Aeron and Java version are you using?

kamma-cc commented 3 years ago

What Aeron and Java version are you using?

Java: 11.0.10 Aeron: 1.32.0

mjpt777 commented 3 years ago

A different approach has been taken to resolve restarted or unavailable nodes and will in the next release.