ibm-messaging / mq-helm

Apache License 2.0
28 stars 34 forks source link

Readiness probe failed: #4

Closed aks3333 closed 2 years ago

aks3333 commented 2 years ago

hi,

I am using this latest helm chart to create NativeHA MQ but getting issue of readiness probe failed.

Please help to resolve the issue & let me know if you need more info on that.

below you will get the output of "describe pod" & "Logs of the pod"

Describe -

Name: mq-poc1-ibm-mq-0 Namespace: mq Priority: 0 Node: ip- Start Time: Thu, 16 Dec 2021 00:38:31 +0100 Labels: app.kubernetes.io/instance=mq-poc1 app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=ibm-mq app.kubernetes.io/version=9.2.4.0 controller-revision-hash=mq-poc1-ibm-mq-fcdc6dd87 helm.sh/chart=ibm-mq-1.0.0 statefulSetName=mq-poc1-ibm-mq statefulset.kubernetes.io/pod-name=mq-poc1-ibm-mq-0 Annotations: kubernetes.io/psp: eks.privileged Status: Running IP:
IPs: IP:
Controlled By: StatefulSet/mq-poc1-ibm-mq Containers: qmgr: Container ID: docker://e7a1f1a200ccf98b9b6ac5a2b4dbae0d9a8576ef6e62f5f5fde4391fb7ab11b5 Image: ibmcom/mq:9.2.4.0-r1 Image ID: docker-pullable://ibmcom/mq@sha256:7590ea14750ecba7bd24b758dc9978d2280e880fcda6b4a996068966dea8c61d Ports: 1414/TCP, 9443/TCP, 9157/TCP, 9414/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP State: Running Started: Thu, 16 Dec 2021 00:38:41 +0100 Ready: False Restart Count: 0 Limits: cpu: 2 memory: 4Gi Requests: cpu: 2 memory: 2Gi Liveness: exec [chkmqhealthy] delay=0s timeout=5s period=10s #success=1 #failure=3 Readiness: exec [chkmqready] delay=0s timeout=3s period=5s #success=1 #failure=1 Startup: exec [chkmqstarted] delay=0s timeout=5s period=5s #success=1 #failure=24 Environment: LICENSE: accept MQ_QMGR_NAME: mqpoc1 MQ_NATIVE_HA: true AMQ_CLOUD_PAK: true MQ_NATIVE_HA_INSTANCE_0_NAME: mq-poc1-ibm-mq-0 MQ_NATIVE_HA_INSTANCE_0_REPLICATION_ADDRESS: mq-poc1-ibm-mq-replica-0(9414) MQ_NATIVE_HA_INSTANCE_1_NAME: mq-poc1-ibm-mq-1 MQ_NATIVE_HA_INSTANCE_1_REPLICATION_ADDRESS: mq-poc1-ibm-mq-replica-1(9414) MQ_NATIVE_HA_INSTANCE_2_NAME: mq-poc1-ibm-mq-2 MQ_NATIVE_HA_INSTANCE_2_REPLICATION_ADDRESS: mq-poc1-ibm-mq-replica-2(9414) LOG_FORMAT: basic MQ_ENABLE_METRICS: true DEBUG: false MQ_ENABLE_TRACE_CRTMQDIR: false MQ_ENABLE_TRACE_CRTMQM: false MQ_EPHEMERAL_PREFIX: /run/mqm MQ_GRACE_PERIOD: 29 Mounts: /etc/mqm/mq.ini from ini-cm-helmsecurepoc1 (ro,path="mq.ini") /etc/mqm/mq.mqsc from mqsc-cm-helmsecurepoc1 (ro,path="mq.mqsc") /etc/mqm/pki/keys/ibmwebspheremqqmgr1 from ibmwebspheremqqmgr1 (ro) /mnt/mqm from qm (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2s9z2 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: qm: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: qm-mq-poc1-ibm-mq-0 ReadOnly: false ibmwebspheremqqmgr1: Type: Secret (a volume populated by a Secret) SecretName: qmgr1secret Optional: false mqsc-cm-helmsecurepoc1: Type: ConfigMap (a volume populated by a ConfigMap) Name: helmsecurepoc1 Optional: false ini-cm-helmsecurepoc1: Type: ConfigMap (a volume populated by a ConfigMap) Name: helmsecurepoc1 Optional: false kube-api-access-2s9z2: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Scheduled 56s default-scheduler Successfully assigned mq/mq-poc1-ibm-mq-0 to ip-**.eu-west-1.compute.internal Normal SuccessfulAttachVolume 50s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c5c0ccaa-6251-46fb-9407-6a21a9035cd2" Normal Pulled 46s kubelet Container image "ibmcom/mq:9.2.4.0-r1" already present on machine Normal Created 45s kubelet Created container qmgr Normal Started 45s kubelet Started container qmgr Warning Unhealthy 36s (x2 over 41s) kubelet Startup probe failed: Warning Unhealthy 5s (x5 over 25s) kubelet Readiness probe failed:


Logs of the pod -

2021-12-15T23:38:44.180Z CPU architecture: amd64 2021-12-15T23:38:44.180Z Linux kernel version: 5.4.149-73.259.amzn2.x86_64 2021-12-15T23:38:44.180Z Container runtime: kube 2021-12-15T23:38:44.180Z Base image: Red Hat Enterprise Linux 8.5 (Ootpa) 2021-12-15T23:38:44.180Z Running as user ID 1001 with primary group 0, and supplementary groups 1000 2021-12-15T23:38:44.180Z Capabilities: none 2021-12-15T23:38:44.180Z seccomp enforcing mode: disabled 2021-12-15T23:38:44.180Z Process security attributes: none 2021-12-15T23:38:44.181Z Detected 'ext4' volume mounted to /mnt/mqm 2021-12-15T23:38:48.878Z Using queue manager name: mqpoc1 2021-12-15T23:38:48.892Z Created directory structure under /var/mqm 2021-12-15T23:38:48.892Z Image created: 2021-11-12T16:26:21+00:00 2021-12-15T23:38:48.892Z Image tag: ibm-mqadvanced-server-dev:9.2.4.0-r1.20211112161954.1f6d37a-amd64 2021-12-15T23:38:48.933Z MQ version: 9.2.4.0 2021-12-15T23:38:48.933Z MQ level: p924-L211105.DE 2021-12-15T23:38:48.933Z MQ license: Developer 2021-12-15T23:38:51.147Z Creating queue manager mqpoc1 2021-12-15T23:38:51.148Z Starting web server 2021-12-15T23:38:51.164Z Detected existing queue manager mqpoc1 2021-12-15T23:38:51.183Z Removing existing ServiceComponent configuration 2021-12-15T23:38:51.184Z Starting queue manager 2021-12-15T23:38:51.203Z AMQ6206I: Command strmqm was issued. [CommentInsert1(strmqm), CommentInsert2(strmqm -x mqpoc1)] 2021-12-15T23:38:51.275Z Initializing MQ Advanced for Developers custom authentication service 2021-12-15T23:38:51.275Z mqhtpass: MQStart options=Primary qmgr=mqpoc1 2021-12-15T23:38:51.350Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 2021-12-15T23:38:51.215Z AMQ5775I: Successfully applied automatic configuration INI definitions. [CommentInsert1(INI)] 2021-12-15T23:38:51.395Z AMQ5051I: The queue manager task 'LOG-FORMAT' has started. [ArithInsert2(1), CommentInsert1(LOG-FORMAT)] 2021-12-15T23:38:51.397Z AMQ5051I: The queue manager task 'LOGGER-IO' has started. [ArithInsert2(1), CommentInsert1(LOGGER-IO)] 2021-12-15T23:38:51.410Z AMQ5051I: The queue manager task 'NATIVE-HA' has started. [ArithInsert2(1), CommentInsert1(NATIVE-HA)] 2021-12-15T23:38:51.501Z AMQ7814I: IBM MQ queue manager running as replica instance 'mq-poc1-ibm-mq-1'. [CommentInsert2(mq-poc1-ibm-mq-1), CommentInsert3(mqpoc1)] 2021-12-15T23:38:51.516Z AMQ3208E: Native HA network connection to 'mq-poc1-ibm-mq-0' could not be established. [CommentInsert1(mq-poc1-ibm-mq-0), CommentInsert2(mq-poc1-ibm-mq-replica-0(9414)), CommentInsert3(rrcE_HOST_NOT_AVAILABLE - Remote host not available, retry later. (111) (0x6F) (mq-poc1-ibm-mq-replica-0 (9414)) (TCP/IP) (????))] 2021-12-15T23:38:51.522Z AMQ3211I: Native HA outbound connection established to 'mq-poc1-ibm-mq-2'. [CommentInsert1(mq-poc1-ibm-mq-2), CommentInsert2(mq-poc1-ibm-mq-replica-2(9414))] 2021-12-15T23:38:51.523Z AMQ3235I: Native HA instance 'mq-poc1-ibm-mq-1' is not connected to enough other instances to start the process of selecting the active instance. [ArithInsert2(1), CommentInsert1(mq-poc1-ibm-mq-1), CommentInsert2(mqpoc1), CommentInsert3(Full)] 2021-12-15T23:38:51.529Z AMQ3213I: Native HA inbound connection accepted from 'mq-poc1-ibm-mq-2'. [CommentInsert1(mq-poc1-ibm-mq-2), CommentInsert2(10.32.0.3)] 2021-12-15T23:38:51.543Z AMQ3215I: The local Native HA instance 'mq-poc1-ibm-mq-1' is now the active instance of queue manager 'mqpoc1'. [ArithInsert1(2), CommentInsert1(mq-poc1-ibm-mq-1), CommentInsert2(mqpoc1)] 2021-12-15T23:38:51.551Z AMQ7816I: IBM MQ queue manager 'mqpoc1' active instance 'mq-poc1-ibm-mq-1' has a quorum of synchronised replicas available. [CommentInsert2(mq-poc1-ibm-mq-1), CommentInsert3(mqpoc1)] 2021-12-15T23:38:51.663Z AMQ7229I: 207 log records accessed on queue manager 'mqpoc1' during the log replay phase. [ArithInsert1(207), CommentInsert1(mqpoc1)] 2021-12-15T23:38:51.664Z AMQ7230I: Log replay for queue manager 'mqpoc1' complete. [ArithInsert1(207), CommentInsert1(mqpoc1)] 2021-12-15T23:38:51.664Z AMQ5051I: The queue manager task 'CHECKPOINT' has started. [ArithInsert2(1), CommentInsert1(CHECKPOINT)] 2021-12-15T23:38:51.666Z AMQ7231I: 0 log records accessed on queue manager 'mqpoc1' during the recovery phase. [CommentInsert1(mqpoc1)] 2021-12-15T23:38:51.667Z AMQ7232I: Transaction manager state recovered for queue manager 'mqpoc1'. [CommentInsert1(mqpoc1)] 2021-12-15T23:38:51.997Z Started replica queue manager 2021-12-15T23:38:52.022Z Starting metrics gathering 2021-12-15T23:38:51.762Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 �����⌂ 2021-12-15T23:38:51.775Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 `�Ti�⌂ 2021-12-15T23:38:51.798Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 � 2021-12-15T23:38:51.849Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 2021-12-15T23:38:51.851Z mqhtpass: mqhtpass_authenticate_user without CSP user set. effectiveuid=mqm env=0, callertype=1, type=0, accttoken=102380180 applidentitydata=102380212 2021-12-15T23:38:51.857Z mqhtpass: mqhtpass_authenticate_user without CSP user set. effectiveuid=mqm env=3, callertype=1, type=0, accttoken=83583636 applidentitydata=83583668 2021-12-15T23:38:51.936Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 л���⌂ 2021-12-15T23:38:51.938Z mqhtpass: Terminating secondary 2021-12-15T23:38:51.969Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 ��N��⌂ 2021-12-15T23:38:51.971Z mqhtpass: Terminating secondary 2021-12-15T23:38:52.014Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 @�k(�⌂ 2021-12-15T23:38:52.027Z mqhtpass: mqhtpass_authenticate_user without CSP user set. effectiveuid=mqm env=0, callertype=1, type=0, accttoken=81986196 applidentitydata=81986228 2021-12-15T23:38:52.034Z mqhtpass: MQStart options=Secondary qmgr=mqpoc1 2021-12-15T23:38:51.702Z AMQ7467I: The oldest log file required to start queue manager mqpoc1 is S0000000.LOG. [CommentInsert1(mqpoc1), CommentInsert2(S0000000.LOG)] 2021-12-15T23:38:51.702Z AMQ7468I: The oldest log file required to perform media recovery of queue manager mqpoc1 is S0000000.LOG. [CommentInsert1(mqpoc1), CommentInsert2(S0000000.LOG)] 2021-12-15T23:38:51.702Z AMQ7233I: 0 out of 0 in-flight transactions resolved for queue manager 'mqpoc1'. [CommentInsert1(mqpoc1)] 2021-12-15T23:38:51.710Z AMQ7467I: The oldest log file required to start queue manager mqpoc1 is S0000000.LOG. [CommentInsert1(mqpoc1), CommentInsert2(S0000000.LOG)] 2021-12-15T23:38:51.710Z AMQ7468I: The oldest log file required to perform media recovery of queue manager mqpoc1 is S0000000.LOG. [CommentInsert1(mqpoc1), CommentInsert2(S0000000.LOG)] 2021-12-15T23:38:51.743Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(3), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.743Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(1), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.744Z AMQ5037I: The queue manager task 'ERROR-LOG' has started. [ArithInsert2(1), CommentInsert1(ERROR-LOG)] 2021-12-15T23:38:51.744Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(2), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.744Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(4), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.744Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(5), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.745Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(6), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.745Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(7), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.747Z AMQ5037I: The queue manager task 'APP-SIGNAL' has started. [ArithInsert2(8), CommentInsert1(APP-SIGNAL)] 2021-12-15T23:38:51.783Z AMQ8003I: IBM MQ queue manager 'mqpoc1' started using V9.2.4.0. [CommentInsert1(9.2.4.0), CommentInsert3(mqpoc1)] 2021-12-15T23:38:51.801Z AMQ5051I: The queue manager task 'DUR-SUBS-MGR' has started. [ArithInsert2(1), CommentInsert1(DUR-SUBS-MGR)] 2021-12-15T23:38:51.801Z AMQ9410I: Repository manager started. 2021-12-15T23:38:51.810Z AMQ5051I: The queue manager task 'TOPIC-TREE' has started. [ArithInsert2(1), CommentInsert1(TOPIC-TREE)] 2021-12-15T23:38:51.815Z AMQ5051I: The queue manager task 'IQM-COMMS-MANAGER' has started. [ArithInsert2(1), CommentInsert1(IQM-COMMS-MANAGER)] 2021-12-15T23:38:51.821Z AMQ5024I: The command server has started. ProcessId(246). [ArithInsert1(246), CommentInsert1(SYSTEM.CMDSERVER.1)] 2021-12-15T23:38:51.824Z AMQ5022I: The channel initiator has started. ProcessId(247). [ArithInsert1(247), CommentInsert1(SYSTEM.CHANNEL.INITQ)] 2021-12-15T23:38:51.829Z AMQ5051I: The queue manager task 'AUTOCONFIG' has started. [ArithInsert2(1), CommentInsert1(AUTOCONFIG)] 2021-12-15T23:38:51.852Z AMQ8942I: Starting to process automatic MQSC configuration script. 2021-12-15T23:38:51.859Z AMQ8024I: IBM MQ channel initiator started. [CommentInsert1(SYSTEM.CHANNEL.INITQ)] 2021-12-15T23:38:51.980Z AMQ8940E: An automatic MQSC command was not successful. [ArithInsert1(2), ArithInsert2(4001), CommentInsert1(define ql(MQPOC) usage(xmitq) trigger trigdata(MQPOC1.MQPOC) initq(SYSTEM.CHANNEL.INITQ)), CommentInsert2(AMQ8150E: IBM MQ object already exists.)] 2021-12-15T23:38:51.981Z AMQ8940E: An automatic MQSC command was not successful. [ArithInsert1(2), ArithInsert2(4092), CommentInsert1(define chl(MQPOC1.MQPOC) chltype(sdr) conname('mq-poc-ibm-mq-server(1414)') xmitq(MQPOC) SSLCIPH('TLS_RSA_WITH_AES_256_CBC_SHA')), CommentInsert2(AMQ8242E: SSLCIPH definition wrong.)] 2021-12-15T23:38:51.982Z AMQ8940E: An automatic MQSC command was not successful. [ArithInsert1(2), ArithInsert2(4001), CommentInsert1(define chl(MQPOC.MQPOC1) chltype(rcvr) SSLCIPH('ANY_TLS12_OR_HIGHER')), CommentInsert2(AMQ8150E: IBM MQ object already exists.)] 2021-12-15T23:38:51.983Z AMQ8939I: Automatic MQSC configuration script has completed, and contained 31 command(s), of which 3 had errors. [ArithInsert1(31), ArithInsert2(3), CommentInsert1(0)] 2021-12-15T23:38:51.984Z AMQ5037I: The queue manager task 'STATISTICS' has started. [ArithInsert2(1), CommentInsert1(STATISTICS)] 2021-12-15T23:38:51.984Z AMQ5037I: The queue manager task 'MARKINTSCAN' has started. [ArithInsert2(1), CommentInsert1(MARKINTSCAN)] 2021-12-15T23:38:51.985Z AMQ5037I: The queue manager task 'DEFERRED_DELIVERY' has started. [ArithInsert2(1), CommentInsert1(DEFERRED_DELIVERY)] 2021-12-15T23:38:51.985Z AMQ5037I: The queue manager task 'DEFERRED-MSG' has started. [ArithInsert2(1), CommentInsert1(DEFERRED-MSG)] 2021-12-15T23:38:51.985Z AMQ9722W: Plain text communication is enabled. 2021-12-15T23:38:51.986Z AMQ5026I: The listener 'SYSTEM.LISTENER.TCP.1' has started. ProcessId(281). [ArithInsert1(281), CommentInsert1(SYSTEM.LISTENER.TCP.1)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'MEDIA-IMAGES' has started. [ArithInsert2(1), CommentInsert1(MEDIA-IMAGES)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'RESOURCE_MONITOR' has started. [ArithInsert2(1), CommentInsert1(RESOURCE_MONITOR)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'ACTVTRC' has started. [ArithInsert2(1), CommentInsert1(ACTVTRC)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'LOGGEREV' has started. [ArithInsert2(1), CommentInsert1(LOGGEREV)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'EXPIRER' has started. [ArithInsert2(1), CommentInsert1(EXPIRER)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'Q-DELETION' has started. [ArithInsert2(1), CommentInsert1(Q-DELETION)] 2021-12-15T23:38:51.989Z AMQ5051I: The queue manager task 'PRESERVED-Q' has started. [ArithInsert2(1), CommentInsert1(PRESERVED-Q)] 2021-12-15T23:38:51.990Z AMQ5051I: The queue manager task 'ASYNCQ' has started. [ArithInsert2(1), CommentInsert1(ASYNCQ)] 2021-12-15T23:38:51.994Z AMQ5051I: The queue manager task 'MULTICAST' has started. [ArithInsert2(1), CommentInsert1(MULTICAST)] 2021-12-15T23:38:51.997Z AMQ5052I: The queue manager task 'QPUBSUB-CTRLR' has started. [ArithInsert2(1), CommentInsert1(QPUBSUB-CTRLR)] 2021-12-15T23:38:51.998Z AMQ5052I: The queue manager task 'QPUBSUB-QUEUE-NLCACHE' has started. [ArithInsert2(1), CommentInsert1(QPUBSUB-QUEUE-NLCACHE)] 2021-12-15T23:38:51.998Z AMQ5052I: The queue manager task 'QPUBSUB-SUBPT-NLCACHE' has started. [ArithInsert2(1), CommentInsert1(QPUBSUB-SUBPT-NLCACHE)] 2021-12-15T23:38:52.000Z AMQ5052I: The queue manager task 'PUBSUB-DAEMON' has started. [ArithInsert2(1), CommentInsert1(PUBSUB-DAEMON)] 2021-12-15T23:38:52.000Z AMQ5975I: 'IBM MQ Distributed Pub/Sub Controller' has started. [CommentInsert1(IBM MQ Distributed Pub/Sub Controller)] 2021-12-15T23:38:52.004Z AMQ5975I: 'IBM MQ Distributed Pub/Sub Fan Out Task' has started. [CommentInsert1(IBM MQ Distributed Pub/Sub Fan Out Task)] 2021-12-15T23:38:52.004Z AMQ5975I: 'IBM MQ Distributed Pub/Sub Command Task' has started. [CommentInsert1(IBM MQ Distributed Pub/Sub Command Task)] 2021-12-15T23:38:52.005Z AMQ5975I: 'IBM MQ Distributed Pub/Sub Publish Task' has started. [CommentInsert1(IBM MQ Distributed Pub/Sub Publish Task)] 2021-12-15T23:38:52.025Z AMQ5806I: Queued Publish/Subscribe Daemon started for queue manager mqpoc1. [CommentInsert1(mqpoc1)] 2021-12-15T23:38:52.061Z AMQ3213I: Native HA inbound connection accepted from 'mq-poc1-ibm-mq-0'. [CommentInsert1(mq-poc1-ibm-mq-0), CommentInsert2(10.38.0.1)] 2021-12-15T23:38:52.078Z AMQ3211I: Native HA outbound connection established to 'mq-poc1-ibm-mq-0'. [CommentInsert1(mq-poc1-ibm-mq-0), CommentInsert2(mq-poc1-ibm-mq-replica-0(9414))] 2021-12-15T23:39:20.616Z Started web server 2021-12-15T23:39:21.989Z AMQ5041I: The queue manager task 'AUTOCONFIG' has ended. [CommentInsert1(AUTOCONFIG)]

aks3333 commented 2 years ago

Issue has been resolved by changing the readiness probe timeout value from 3 to 10.

callumpjackson commented 2 years ago

To provide some additional context for future users.

In Kubernetes there are the concepts of liveness, readiness and startup probes. Within a Native HA deployment all three are used, but this issue was regarding the liveness probe. These probes have the following configuration options (deliberately copied from the Kubernetes documentation):

In the specified case I understand the Pod was restarting due to the failureThreshold of the liveness probe being reached. In this case they set the failureThreshold to 1, and the timeoutSecond set to 3. With these settings we have seen restarts when the container is under heavy load.

We have found that a failureThreshold of 3 is normally adequate (hence the default) but this does need to be tested and configured within your environment. In the above case they decided to increase the timeoutSecond period which is another approach.