vert-x3 / vertx-hazelcast

Hazelcast Cluster Manager for Vert.x
Apache License 2.0
78 stars 75 forks source link

Cluster manager is corrupted after merging of hazelcast partition #90

Closed michalsida closed 6 years ago

michalsida commented 6 years ago

I think that there is problem, that Vert.x use Hazelcast local interface endpoint UUID as unique constant identification of node, see io.vertx.spi.cluster.hazelcast.HazelcastClusterManager#getNodeID and initialization of nodeId field: nodeID = hazelcast.getLocalEndpoint().getUuid() [in io.vertx.spi.cluster.hazelcast.HazelcastClusterManager#join called from io.vertx.core.impl.VertxImpl#VertxImpl()]

This nodeID is used for node registration under multimap of topic subcribers "__vertx.subs" and it is used e.g. for subscriber removing from disconnected nodes (lambda in io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler)

But it looks that in some situation is this UUID regenerated, see com.hazelcast.instance.Node#setNewLocalMember, e.g during merging of hazelcast partiotions.

After that it is in the situation, that hazelcast knows new node UUID, but the vertx registers topics still under the old value, I did not find any place, where the nodeId would be updated. And the lambda from io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler will remove subscribers for this node from subscriber multimap.

I add some logging mechanism, which every 30s compares nodeId from Hazelcast and from Vertx HazelcastClusterManager:

final ClusterManager clusterManager = ((VertxImpl) vertx).getClusterManager();
final String currentNodeID = ((VertxImpl) vertx).getNodeID();

if (clusterManager instanceof HazelcastClusterManager) {
    String currentHazelcastNodeID = ((HazelcastClusterManager) clusterManager).getHazelcastInstance().getLocalEndpoint().getUuid();
    if (!currentNodeID.equals(currentHazelcastNodeID)) {
            getLogger().error("Hazelcast local endpoint {} UUID {} differs from Vertx NodeId {}",
                    ((HazelcastClusterManager) clusterManager).getHazelcastInstance().getLocalEndpoint().getSocketAddress().toString(),
                    currentHazelcastNodeID, currentNodeID);
    }
}

And after hazelcast cluster merge is this in the log:

TID: [2018-06-21 15:01:39,947] WARN [c.h.i.c.i.DiscoveryJoiner] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-11) [] - [10.148.250.33]:5703 [hazelcast-consul-discovery-spi] [3.8.2] [10.148.250.33]:5703 is merging [tcp/ip] to [10.148.250.34]:5702
TID: [2018-06-21 15:01:39,973] WARN [c.h.i.c.i.o.MergeClustersOperation] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-11) [] - [10.148.250.33]:5703 [hazelcast-consul-discovery-spi] [3.8.2] [10.148.250.33]:5703 is merging to [10.148.250.34]:5702, because: instructed by master [10.148.250.33]:5703
TID: [2018-06-21 15:01:39,977] INFO [c.c.m.l.c.m.h.l.NodeLifecycleListener] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-17) [] - Hazelcast state changed: LifecycleEvent [state=MERGING]
TID: [2018-06-21 15:01:39,978] WARN [c.hazelcast.instance.Node] (hz.MCI_SERVICE_CAMPAIGN.cached.thread-17) [] - [10.148.250.33]:5703 [hazelcast-consul-discovery-spi] [3.8.2] Setting new local member. old uuid: 82ffa5f9-f059-48be-be16-7528c547fdd8 new uuid: 2446732d-70df-4201-bccb-7bec82f384fd
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5702 - 455989de-a9bc-4964-83d1-ec463bdda952,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.37]:5701 - efef4cfe-8463-4e3b-aa34-eca29b0b6157,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5701 - 7efdcfcb-5460-4e8d-ac61-1ac1a8eaba8b,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5706 - 26cf5948-8718-4230-a5bc-1b9ee0ed6015,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5707 - dcfdeb3d-6ebb-474d-80bc-9bfece2d771a,type=added}
TID: [2018-06-21 15:01:46,082] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5702 - aaf66b6b-4026-452c-80d4-cd6cd15fa3a9,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5708 - 72ee2e95-f237-4687-9b1e-973c9cd427b6,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5709 - 170ab501-1b1a-48b1-ad7f-e4cbe12fa5dc,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5703 - ee16f3e5-88f8-4fa4-9efb-a06749ee0996,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5710 - 9f9b3669-4d8b-42c8-b724-52286908f6e0,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5706 - e14bd98a-060a-460f-b352-2fb39399101a,type=added}
TID: [2018-06-21 15:01:46,083] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5711 - eaeeca1d-5c20-4847-8d34-388ed2167f4c,type=added}
TID: [2018-06-21 15:01:46,087] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5704 - 800be4ec-4921-46c8-b20e-067fe4ac3f84,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5705 - 80e49d13-6de8-409e-85ac-59f17deb8f9e,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5703 - fa85141d-b02c-4078-91c6-ed66cd176452,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5704 - b82add35-1d7e-43c8-9388-fdfd997f4121,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5701 - 39e7ae3e-bb62-47c1-8da7-190669c058ef,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.34]:5705 - 69b6f627-3840-4b6e-9705-dfd158e64dc3,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5707 - 92d52042-b967-4075-b4b9-f9023bba2d49,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5708 - 09771edf-56fd-496e-a958-725fd4120357,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5709 - 4cd1905b-2e71-4bbd-8733-5f7e5268f30d,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.238.196]:5710 - 366ebd54-714c-4ed4-8637-03cdadac87fe,type=added}
TID: [2018-06-21 15:01:46,088] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.33]:5701 - 764ba439-4812-418d-975c-0c3ad4a84b0f,type=added}
TID: [2018-06-21 15:01:46,089] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.33]:5709 - 23f95919-4682-4003-be2e-cac876b47f70,type=added}
TID: [2018-06-21 15:01:46,089] DEBUG [c.c.m.l.c.m.h.l.ClusterMembershipListener] (hz.MCI_SERVICE_CAMPAIGN.event-7) [] - Hazelcast member added: MembershipEvent {member=Member [10.148.250.33]:5702 - 2991f9e2-72d8-49e2-9135-b3dd964fe53d,type=added}
TID: [2018-06-21 15:01:46,325] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration started: MigrationEvent{partitionId=0, status=STARTED, oldOwner=Member [10.148.250.33]:5701 - 764ba439-4812-418d-975c-0c3ad4a84b0f, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,359] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration completed: MigrationEvent{partitionId=0, status=COMPLETED, oldOwner=Member [10.148.250.33]:5701 - 764ba439-4812-418d-975c-0c3ad4a84b0f, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,646] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration started: MigrationEvent{partitionId=52, status=STARTED, oldOwner=Member [10.148.238.196]:5710 - 366ebd54-714c-4ed4-8637-03cdadac87fe, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,656] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration completed: MigrationEvent{partitionId=52, status=COMPLETED, oldOwner=Member [10.148.238.196]:5710 - 366ebd54-714c-4ed4-8637-03cdadac87fe, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,685] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration started: MigrationEvent{partitionId=64, status=STARTED, oldOwner=Member [10.148.250.34]:5701 - 39e7ae3e-bb62-47c1-8da7-190669c058ef, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,693] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration completed: MigrationEvent{partitionId=64, status=COMPLETED, oldOwner=Member [10.148.250.34]:5701 - 39e7ae3e-bb62-47c1-8da7-190669c058ef, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration started: MigrationEvent{partitionId=33, status=STARTED, oldOwner=Member [10.148.238.196]:5703 - ee16f3e5-88f8-4fa4-9efb-a06749ee0996, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration completed: MigrationEvent{partitionId=33, status=COMPLETED, oldOwner=Member [10.148.238.196]:5703 - ee16f3e5-88f8-4fa4-9efb-a06749ee0996, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration started: MigrationEvent{partitionId=58, status=STARTED, oldOwner=Member [10.148.250.34]:5706 - 26cf5948-8718-4230-a5bc-1b9ee0ed6015, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,755] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration completed: MigrationEvent{partitionId=58, status=COMPLETED, oldOwner=Member [10.148.250.34]:5706 - 26cf5948-8718-4230-a5bc-1b9ee0ed6015, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,779] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration started: MigrationEvent{partitionId=103, status=STARTED, oldOwner=Member [10.148.250.34]:5703 - fa85141d-b02c-4078-91c6-ed66cd176452, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,788] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-9) [] - Hazelcast migration completed: MigrationEvent{partitionId=103, status=COMPLETED, oldOwner=Member [10.148.250.34]:5703 - fa85141d-b02c-4078-91c6-ed66cd176452, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,804] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration started: MigrationEvent{partitionId=124, status=STARTED, oldOwner=Member [10.148.250.37]:5701 - efef4cfe-8463-4e3b-aa34-eca29b0b6157, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,822] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration completed: MigrationEvent{partitionId=124, status=COMPLETED, oldOwner=Member [10.148.250.37]:5701 - efef4cfe-8463-4e3b-aa34-eca29b0b6157, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,875] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration started: MigrationEvent{partitionId=169, status=STARTED, oldOwner=Member [10.148.238.196]:5701 - 7efdcfcb-5460-4e8d-ac61-1ac1a8eaba8b, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,892] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-10) [] - Hazelcast migration completed: MigrationEvent{partitionId=169, status=COMPLETED, oldOwner=Member [10.148.238.196]:5701 - 7efdcfcb-5460-4e8d-ac61-1ac1a8eaba8b, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,892] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration started: MigrationEvent{partitionId=187, status=STARTED, oldOwner=Member [10.148.250.34]:5708 - 72ee2e95-f237-4687-9b1e-973c9cd427b6, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,907] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-8) [] - Hazelcast migration completed: MigrationEvent{partitionId=187, status=COMPLETED, oldOwner=Member [10.148.250.34]:5708 - 72ee2e95-f237-4687-9b1e-973c9cd427b6, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,928] DEBUG [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration started: MigrationEvent{partitionId=200, status=STARTED, oldOwner=Member [10.148.238.196]:5705 - 80e49d13-6de8-409e-85ac-59f17deb8f9e, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:46,944] INFO [c.c.m.l.c.m.h.l.ClusterMigrationListener] (hz.MCI_SERVICE_CAMPAIGN.event-6) [] - Hazelcast migration completed: MigrationEvent{partitionId=200, status=COMPLETED, oldOwner=Member [10.148.238.196]:5705 - 80e49d13-6de8-409e-85ac-59f17deb8f9e, newOwner=Member [10.148.250.33]:5703 - 2446732d-70df-4201-bccb-7bec82f384fd this}
TID: [2018-06-21 15:01:47,279] ERROR [c.c.m.s.c.Application] (vert.x-eventloop-thread-0) [] - Hazelcast local endpoint /10.148.250.33:5703 UUID 2446732d-70df-4201-bccb-7bec82f384fd differs from Vertx NodeId 82ffa5f9-f059-48be-be16-7528c547fdd8

But new registered subscribers are still registered under 82ffa5f9-f059-48be-be16-7528c547fdd8, I registered some subscribers after this operation and in MultiMap is e.g. this:

{
  "key": "topic-getCampaignMaterials",
  "values": [
    {
      "serverId": "10.148.250.34:15702", -- subsriber from another node
      "nodeId": "455989de-a9bc-4964-83d1-ec463bdda952"
    },
    {
      "serverId": "10.148.250.33:15703", -- Hazelcast port + 10000
      "nodeId": "82ffa5f9-f059-48be-be16-7528c547fdd8"
    }
  ]
}

but 82ffa5f9-f059-48be-be16-7528c547fdd8 is not list of Hazelcast members, there is uuid 2446732d-70df-4201-bccb-7bec82f384fd for [10.148.250.33]:5703 only. And if some nodes are removed/added to cluster, the lambda in io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler will remove these subsribers, I think.

And the subsribers registred earlier from this node are lost, because the multimap recovery is in Hazelcast implemented in last releases. I tried use the latest release of Hazelcast, multimap recovery is possibly solved there, but problem with unupdated nodeId-UUID remains, so the subsribers are removed from map by Vertx anyway.

Shouldn't be there some nodeId updating after Hazelcast merge notification?

Used versions: Vert.X 3.5.2, Hazelcast: 3.8.2 (and 3.10.2)

Link to original discussion topic.

tsegismont commented 6 years ago

@michalsida thank you, great report!

rvega-arg commented 6 years ago

I can confirm i'm having the same issue with Vert.X 3.5.3, Hazelcast: 3.8.2 (and 3.10.4)

@michalsida did you find any workaround?

michalsida commented 6 years ago

@rvega-arg My workaround is that I have registered a timer controlling internal state of topic multimap (once per minute) and if it detects, that any of own registered topic is missing in the multimap, it will unregister all member topics and register them again.

rvega-arg commented 6 years ago

Another issue related to multimaps https://github.com/hazelcast/hazelcast/issues/13559

Birmania commented 6 years ago

@michalsida First, I want to thank you for this analysis. We are currently encountering the same problem as you on our Project.

Question : Your workaround would be really useful in our context. If I understand well, your watchdog iterate over the subs multimap (every minute) to check existency of current verticle owned and registered topic and, if one is missing, you unregister/re-register the full Verticle ?

However, would it be sufficent to create the watchdog on the only principle of comparing the Vert.X UUID and Hazelcast UUID ? Can we consider that distinct UUID are always the result of a problem of Brain Split Merge ? If yes, it seems more simple than checking the sub multimap but I could be wrong ? The only counterpart I see is that it could redeploy your Verticle even if you do not consume any topic on the clustered Event Bus...

Thanks for your answer/help !

tsegismont commented 6 years ago

@Birmania see #95 , this should be fixed in 3.6

michalsida commented 6 years ago

@Birmania Yes, I did it exactly by this way. I can send a code snippet, which covers this. May be controlling of node UUID would be sufficient, but I was lucky to find some working solution, so I keep it in that way.

@tsegismont Great, I am looking forward to this version

tsegismont commented 6 years ago

@michalsida if you give a try to the snapshot version it would be great. In any case, thanks again for the thorough analysis, it was a great contribution!

Birmania commented 6 years ago

@tsegismont Excellent news for the fix ! However we need a workaround to deliver a client in 4 weeks.

@michalsida Thanks for the answer, I am really interested by your snippet. How can we exchange ?

tsegismont commented 6 years ago

@Birmania the fix has been backported to the 3.5 branch. Vert.x 3.5.4 should be out in the next couple of weeks.

michalsida commented 6 years ago

@tsegismont I hope I will try it soon and I can give feedback

@Birmania Look at this snippet It's a little ugly and there are some references to our other code (and come references were removed before posting), but I hope it can illustrate my approach and it works for our purposes.

Birmania commented 6 years ago

@michalsida Thanks a lot for this snippet, it will be very useful for us !

@tsegismont Cool ! Thanks for the tip about the incoming backport. Edit : I checked the pom and it does not use the 3.10 (management de MultiMap in merge) version of Hazelcast. Will subscribers map be ok after the split brain merge ?