Closed michalsida closed 6 years ago
@michalsida thank you, great report!
I can confirm i'm having the same issue with Vert.X 3.5.3, Hazelcast: 3.8.2 (and 3.10.4)
@michalsida did you find any workaround?
@rvega-arg My workaround is that I have registered a timer controlling internal state of topic multimap (once per minute) and if it detects, that any of own registered topic is missing in the multimap, it will unregister all member topics and register them again.
Another issue related to multimaps https://github.com/hazelcast/hazelcast/issues/13559
@michalsida First, I want to thank you for this analysis. We are currently encountering the same problem as you on our Project.
Question : Your workaround would be really useful in our context. If I understand well, your watchdog iterate over the subs multimap (every minute) to check existency of current verticle owned and registered topic and, if one is missing, you unregister/re-register the full Verticle ?
However, would it be sufficent to create the watchdog on the only principle of comparing the Vert.X UUID and Hazelcast UUID ? Can we consider that distinct UUID are always the result of a problem of Brain Split Merge ? If yes, it seems more simple than checking the sub multimap but I could be wrong ? The only counterpart I see is that it could redeploy your Verticle even if you do not consume any topic on the clustered Event Bus...
Thanks for your answer/help !
@Birmania see #95 , this should be fixed in 3.6
@Birmania Yes, I did it exactly by this way. I can send a code snippet, which covers this. May be controlling of node UUID would be sufficient, but I was lucky to find some working solution, so I keep it in that way.
@tsegismont Great, I am looking forward to this version
@michalsida if you give a try to the snapshot version it would be great. In any case, thanks again for the thorough analysis, it was a great contribution!
@tsegismont Excellent news for the fix ! However we need a workaround to deliver a client in 4 weeks.
@michalsida Thanks for the answer, I am really interested by your snippet. How can we exchange ?
@Birmania the fix has been backported to the 3.5 branch. Vert.x 3.5.4 should be out in the next couple of weeks.
@tsegismont I hope I will try it soon and I can give feedback
@Birmania Look at this snippet It's a little ugly and there are some references to our other code (and come references were removed before posting), but I hope it can illustrate my approach and it works for our purposes.
@michalsida Thanks a lot for this snippet, it will be very useful for us !
@tsegismont Cool ! Thanks for the tip about the incoming backport. Edit : I checked the pom and it does not use the 3.10 (management de MultiMap in merge) version of Hazelcast. Will subscribers map be ok after the split brain merge ?
I think that there is problem, that Vert.x use Hazelcast local interface endpoint UUID as unique constant identification of node, see
io.vertx.spi.cluster.hazelcast.HazelcastClusterManager#getNodeID
and initialization of nodeId field:nodeID = hazelcast.getLocalEndpoint().getUuid()
[inio.vertx.spi.cluster.hazelcast.HazelcastClusterManager#join
called fromio.vertx.core.impl.VertxImpl#VertxImpl()
]This nodeID is used for node registration under multimap of topic subcribers "__vertx.subs" and it is used e.g. for subscriber removing from disconnected nodes (lambda in
io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler
)But it looks that in some situation is this UUID regenerated, see
com.hazelcast.instance.Node#setNewLocalMember
, e.g during merging of hazelcast partiotions.After that it is in the situation, that hazelcast knows new node UUID, but the vertx registers topics still under the old value, I did not find any place, where the nodeId would be updated. And the lambda from
io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler
will remove subscribers for this node from subscriber multimap.I add some logging mechanism, which every 30s compares nodeId from Hazelcast and from Vertx HazelcastClusterManager:
And after hazelcast cluster merge is this in the log:
But new registered subscribers are still registered under 82ffa5f9-f059-48be-be16-7528c547fdd8, I registered some subscribers after this operation and in MultiMap is e.g. this:
but 82ffa5f9-f059-48be-be16-7528c547fdd8 is not list of Hazelcast members, there is uuid 2446732d-70df-4201-bccb-7bec82f384fd for [10.148.250.33]:5703 only. And if some nodes are removed/added to cluster, the lambda in
io.vertx.core.eventbus.impl.clustered.ClusteredEventBus#setClusterViewChangedHandler
will remove these subsribers, I think.And the subsribers registred earlier from this node are lost, because the multimap recovery is in Hazelcast implemented in last releases. I tried use the latest release of Hazelcast, multimap recovery is possibly solved there, but problem with unupdated nodeId-UUID remains, so the subsribers are removed from map by Vertx anyway.
Shouldn't be there some nodeId updating after Hazelcast merge notification?
Used versions: Vert.X 3.5.2, Hazelcast: 3.8.2 (and 3.10.2)
Link to original discussion topic.