Running MQ HA on more then 3 replicas

schmiuwe commented 1 year ago

Hi Callum,

since we are not able to use PDB I simulated this by extending the cluster nodes to total 5 nodes. After this I scaled up the MQ replicas from 3 to 5. Looking at it it does not seem to work. Can you confirm that the whole MQ HA setup only works with 3 replicas? Or could we drive it with e.g. 5 replicas? If it would work I could set max-surge to 1 to only use 1 node to drain during a cluster upgrade. This would ensure that always 1 active MQ replic would be there.

Currently cluster upgrade with 3 pods and setting max-surge to 1, so total 4 nodes during cluster upgrade only draining 1 node at a time does not work. There will be a pending MQ pod and a container creating, the pending one will stay pending, resulting in only one pod left and this one cannot get active because of the quorum, means I was detecting a downtime of around 2 min or more and several times during a cluster upgrade.

Thank you, Uwe

callumpjackson commented 1 year ago

Hi Uwe - Native HA currently only supports 3 running containers, if you want to scale IBM MQ then you would deploy multiple queue managers. Where each could be a Native HA queue manager. I'm not sure I understand the logic above, but let me explain how a cluster upgrade would normally happen.

A Native HA queue manager has deployed three containers
Each container is running on a separate worker node
Two out of the three containers always need to be running
Regardless of the number of worker nodes if you upgrade one at a time, you will keep the two out of three containers will be running during the upgrade.
We the active/leader container is stopped there will be a short window when the queue manager is unavailable but this should be measured in seconds, not the minutes you mention above.

Thanks

schmiuwe commented 1 year ago

Hi Callum,

regarding the cluster upgrade:

I updated the cluster with this command: az aks nodepool update --name kaas --resource-group mqint-kaas --cluster-name mqint-cluster --min-count 3 --max-count 3 --max-surge 1 (this ensures 3 nodes where the 3 MQ pods are running on and it ensures that during the Azure cluster upgrade process only one node gets drained at a time)
Cluster upgrade I initiated like this: az aks upgrade --resource-group mqint-kaas --name mqint-cluster --kubernetes-version 1.25.5

The Azure cluster upgrade runs like it should be:

It created a fourth node with the new cluster version, grabs one of the three old nodes and drains it.
After this it deletes the just drained old node.
Then the process repeats until all three nodes have been swapped and are running on the new cluster version.

What does not work is MQ HA, one pod during this procedure is getting into pending mode. So when the next one node is getting swapped then there is only one pod left and the quorum is not given anymore. Therefore I have observed multiple time that no active MQ pod was available anymore during this upgrade process – what tells me that MQ does not behave properly. The topic now is not about whether another pod takes over anymore, it is more about that cluster upgrade with max-surge set to 1 is also not working with MQ.

The last test would be to drain the nodes manually but this is actually not what we want.

Did you test this case on your end already?

Thank you, Uwe

From: callumpjackson @.> Sent: Friday, June 2, 2023 3:47 PM To: ibm-messaging/mq-helm @.> Cc: Schmiedel Uwe, FG-232 @.>; Author @.> Subject: Re: [ibm-messaging/mq-helm] Running MQ HA on more then 3 replicas (Issue #46)

Sent from outside the BMW organization - be CAUTIOUS, particularly with links and attachments.

Absender außerhalb der BMW Organisation - Bitte VORSICHT beim Öffnen von Links und Anhängen.

Hi Uwe - Native HA currently only supports 3 running containers, if you want to scale IBM MQ then you would deploy multiple queue managers. Where each could be a Native HA queue manager. I'm not sure I understand the logic above, but let me explain how a cluster upgrade would normally happen.

A Native HA queue manager has deployed three containers
Each container is running on a separate worker node
Two out of the three containers always need to be running
Regardless of the number of worker nodes if you upgrade one at a time, you will keep the two out of three containers will be running during the upgrade.
We the active/leader container is stopped there will be a short window when the queue manager is unavailable but this should be measured in seconds, not the minutes you mention above.

Thanks

— Reply to this email directly, view it on GitHubhttps://github.com/ibm-messaging/mq-helm/issues/46#issuecomment-1573768510, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6UGXELD676DD2U6FWIPKQLXJHVG3ANCNFSM6AAAAAAYYKE4TI. You are receiving this because you authored the thread.Message ID: @.**@.>>

schmiuwe commented 1 year ago

Hi Callum,

I did a manual cluster upgrade today:

Created new node pool
Upgraded control plane
Upgraded new node pool to new version
Uncordened one old node where an inactive MQ instance is running on.
Drained this old node -> resulted in this pod getting and staying in pending state
Deleted old node

In this state it is not possible to continue. Did you really test this scenario already?

I am slowly coming to a point where we cannot use the whole setup since we do not achieve HA and we do not want to perform blue/green deployments. If the setup is not working for cluster upgrades we might consider to stop our whole MQ HA cloud initiative …

Thank you, Uwe

From: Schmiedel Uwe, FG-232 Sent: Monday, June 5, 2023 8:30 AM To: 'ibm-messaging/mq-helm' @.>; ibm-messaging/mq-helm @.> Cc: Author @.***> Subject: RE: [ibm-messaging/mq-helm] Running MQ HA on more then 3 replicas (Issue #46)

Hi Callum,

regarding the cluster upgrade:

I updated the cluster with this command: az aks nodepool update --name kaas --resource-group mqint-kaas --cluster-name mqint-cluster --min-count 3 --max-count 3 --max-surge 1 (this ensures 3 nodes where the 3 MQ pods are running on and it ensures that during the Azure cluster upgrade process only one node gets drained at a time)
Cluster upgrade I initiated like this: az aks upgrade --resource-group mqint-kaas --name mqint-cluster --kubernetes-version 1.25.5

The Azure cluster upgrade runs like it should be:

It created a fourth node with the new cluster version, grabs one of the three old nodes and drains it.
After this it deletes the just drained old node.
Then the process repeats until all three nodes have been swapped and are running on the new cluster version.

What does not work is MQ HA, one pod during this procedure is getting into pending mode. So when the next one node is getting swapped then there is only one pod left and the quorum is not given anymore. Therefore I have observed multiple time that no active MQ pod was available anymore during this upgrade process – what tells me that MQ does not behave properly. The topic now is not about whether another pod takes over anymore, it is more about that cluster upgrade with max-surge set to 1 is also not working with MQ.

The last test would be to drain the nodes manually but this is actually not what we want.

Did you test this case on your end already?

Thank you, Uwe

From: callumpjackson @.**@.>> Sent: Friday, June 2, 2023 3:47 PM To: ibm-messaging/mq-helm @.**@.>> Cc: Schmiedel Uwe, FG-232 @.**@.>>; Author @.**@.>> Subject: Re: [ibm-messaging/mq-helm] Running MQ HA on more then 3 replicas (Issue #46)

Sent from outside the BMW organization - be CAUTIOUS, particularly with links and attachments.

Absender außerhalb der BMW Organisation - Bitte VORSICHT beim Öffnen von Links und Anhängen.

Hi Uwe - Native HA currently only supports 3 running containers, if you want to scale IBM MQ then you would deploy multiple queue managers. Where each could be a Native HA queue manager. I'm not sure I understand the logic above, but let me explain how a cluster upgrade would normally happen.

A Native HA queue manager has deployed three containers
Each container is running on a separate worker node
Two out of the three containers always need to be running
Regardless of the number of worker nodes if you upgrade one at a time, you will keep the two out of three containers will be running during the upgrade.
We the active/leader container is stopped there will be a short window when the queue manager is unavailable but this should be measured in seconds, not the minutes you mention above.

Thanks

— Reply to this email directly, view it on GitHubhttps://github.com/ibm-messaging/mq-helm/issues/46#issuecomment-1573768510, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6UGXELD676DD2U6FWIPKQLXJHVG3ANCNFSM6AAAAAAYYKE4TI. You are receiving this because you authored the thread.Message ID: @.**@.>>

arthurbarr commented 1 year ago

I don't think I'm quite clear on the state of the queue manager in your most recent scenario.

However, I can say that having a single node pool across multiple zones has the side effect of creating a race condition, where if the node upgrade happens faster than MQ can regain quorum, then you can kill a second instance of the queue manager and have a failure. PodDisruptionBudget would in theory help here, but we can't currently use PodDisruptionBudget with MQ Native HA, without needing manual intervention during a cluster upgrade.

What I think might work, and match the OpenShift model which we have tested with, is to have a node pool (machine set in OpenShift) for each zone, and thus updating a node pool will cause disruption to a single zone, and there won't be a race condition.

callumpjackson commented 1 year ago

Hi Uwe – thanks for the additional context, the example you provided was helpful. The built-in Azure AKS nodepool upgrade process does appear to be rigid, and we may need to discuss the options available. To effectively do this I’m wondering if a call may assist and will ping you via email. We will update the issue with the conclusions to assist the community.

callumpjackson commented 1 year ago

Closing this issue as we have addressed with a sample and documentation here.

ibm-messaging / mq-helm

Running MQ HA on more then 3 replicas #46