After failover, master and slave both alive

adnklauser commented 6 years ago

Scenario Restart one master node (kubectl delete pod xxx) to simulate a service interruption.

Expected behaviour Slave becomes active immediately and when the master is back up (restarted by k8s) and synchronized, there is still only one active ActiveMQ artemis instance for that master/slave pair.

Actual behaviour Slave becomes active immediately (✔️), but after k8s restarts the master pod, it, too, is considered active (❌), at least from the perspective of k8s (1/1 pods). The consequence of this is that k8s would route requests to both master and slave (via the service DNS)

Additional information I haven't really tested much beyond this observation. I don't know if the master node would have actually responded to requests. But I find it a bit weird that the system doesn't return to the original state after a failover.

The Artemis HA documentation suggests to use <allow-failback>true</allow-failback> on the slave and <check-for-live-server>true</check-for-live-server> on the master. I must confess, I don't understand why the chart explicitly configures the opposite, but my experience with Artemis is very limited so far.

DanSalt commented 5 years ago

Hi @adnklauser

Yes - you're right. I've just checked the latest version of our charts we have locally, and we do indeed have those flags you mention set. The other main difference between our charts and the ones here is in the shared configmap, where it sets: <address>jms</address> Which will only distribute messages starting with 'jms' - which for us didn't include everything. The default in these charts should probably be blank as a generic case (to include everything).

I'll work on a PR for these charts. Hope this helps!

andrusstrockiy commented 4 years ago

We face the same issue when tried to run above chart without any persistence for live (master) node

Even with :

true on the slave and trueOn the masters. After restart New master pod starts from scratch and forms new cluster (apparently without persistence data dir configuration of broxer.xml is ignored completely ) Hence , you get a split brain with running two masters : - master from old cluster formation ( slave0 took his role) - and just formed new cluster from restart of master0 Conclusion: 1. Don’t try that chart without persistence storage in your cluster even with above options. 2. That’s an Artemis problem checked with 2.10 To reproduce on your local setup just Form a cluster. Then remove the data dir for live cluster0 with available broker.xml start a new live server Workaround: In case slplitbrain recreate slave once again and keep an eye on your formation

chandras-xl commented 4 years ago

Are there any updates regarding this issue? I want to use this helm chart in k8s production, But the aforementioned issue still exists, and as a workaround am deleting the slave pod when master restarts. I also tried adding the <check-for-live-server>true</check-for-live-server> and <allow-failback>true</allow-failback> on respective master and slave configmap file but still it doesn't work. Can we expect any upgraded helm chart with proper failover and failback?

andrusstrockiy commented 4 years ago

@chandras-xl The issue is not with chart but with Artemis cluster configuration itself. So If you don’t have any kind persistent storage inside your k8s, move your cluster formation on aretmis to virtual machine I.e as docker image (docker-compose) or run as daemon

chandras-xl commented 4 years ago

@andrusstrockiy Thank you! the failover and failback worked after using persistent storage on my k8s cluster.

vromero / activemq-artemis-helm

After failover, master and slave both alive #22