Ignite Cluster Manager is randomly not clearing cache entries from __vertx.subs leading to half of the Events of failing.

arad1994 commented 3 years ago

Version

Vertx-Ignite -> 4.0.0-milestone5 Vertx-Core -> 4.0.0-milestone5 Ignite Version -> 2.8.1 Zookeeper -> k8s.gcr.io/kubernetes-zookeeper:1.0-3.4.10

Context

We are using Vertx with clustered Event bus using Ignite cluster manager with Zookeeper Service Discovery in a Kubernetes cluster. Randomly when a kubernetes pod is restarted, Sometimes the old entries from ignite cache vertx.subs, vertx.nodes are not cleaned up. This leads to round robining of events causing 50% of events to work. Cause of RoundRobining -> when the kubernetes pod restarts a new nodeId is assigned to that pod which will result in creating a new entry the __vertx.subs cache (New node Id pointing to Event Bus address on the node), if the old NodeId: event Bus address is not removed from the cache it will lead to half of the requests working.

Steps to reproduce

In the kubernetes cluster, If we restart pod multiple times, then randomly we will see the entries from the __vertx.subs is not cleared (old nodeId will still be pointing to the event bus address)

tsegismont commented 3 years ago

@zyclonite would you mind looking into this?

zhoudingyun commented 3 years ago

I also encountered the same problem. Hazelcast and Ignite cluster manager have the same problem.because the cache cannot be clean from vertx.nodeInfo and vertx.subs.

My Temporary method：

zyclonite commented 3 years ago

@arad1994 is it guaranteed that the shutdown is always graceful and not forced?

arad1994 commented 3 years ago

@zyclonite Yes, I think the shutdown is graceful, We have a vertx.close() setup on the shutdown hook as well.

arad1994 commented 3 years ago

@zhoudingyun Yes exactly, I am doing a similar workaround using a periodic poller to unblock us.

zyclonite commented 3 years ago

@arad1994 and/or @zhoudingyun can you tell me if this PR https://github.com/vert-x3/vertx-ignite/pull/96 would fix your issues?

zhoudingyun commented 3 years ago

@zyclonite this PR #96 can only solve some problems. If you use the kill -9 vertx process, the problem still exists. The same is true for hazelcast cluster management.

zyclonite commented 3 years ago

@zhoudingyun did you test that or is this just an assumption? there was definitely a bug allowing only the master to write but for this implementation it was just the oldest member of the cluster, so it could happen if this member leaves ungraceful, that no one updates the vertx nodes and subs when the remaining nodes get the fail event.

also worth to mention, killing a process should not impact the cluster as long as there are enough members left to keep a consistent replicated map of nodes and subs - the remove is now triggered and tried from all remaining nodes

zhoudingyun commented 3 years ago

@zyclonite Unfortunately, this problem occurred in my production environment. My business system adopts the client mode. At that time, the business system was forced to stop (kill -9 pid).The cache cannot be cleaned from vertx.nodeInfo and vertx.subs. And the same problem exists with hazelcast cluster. So I switched to hazelcast and used a temporary method:

tsegismont commented 3 years ago

@zhoudingyun can you please file an issue in vertx-hazelcast?

tsegismont commented 3 years ago

@zhoudingyun it would be great if you could provide a minimal reproducer as well. Thank you

zhoudingyun commented 3 years ago

@tsegismont Are you referring to this?

tsegismont commented 3 years ago

@zhoudingyun in a previous comment you said you found the same issue with the Hazelcast cluster manager. I'm asking if you would mind creating an issue in the vertx-hazelcast repo and provide a minimal reproducer (simple program with your k8s config that demonstrates the issue).

zhoudingyun commented 3 years ago

@tsegismont ok

arad1994 commented 3 years ago

@zyclonite This PR looks good and might help us. Unfortunately I can't test this on the cluster, might take a few weeks and planning to get it out there. But we should merge this PR in.

zyclonite commented 3 years ago

did quite some testing over the last days but could not reproduce the inconsistent subs any more with the latest fixes

i added as well better handling of an ignite node segmentation to the open PR

zyclonite commented 3 years ago

@arad1994 did you had the chance to test the 4.0.0 release?

arad1994 commented 3 years ago

@zyclonite Hey sorry did not get a chance to test the 4.0.0, Might take some time due to other commitments.

arad1994 commented 3 years ago

@zyclonite Finally Got a chance to test the 4.0.0, Things seemed stable for a while, then a day later, some of the pods ended up blocking threads indefinitely on IgniteClusterManager.cleanNodeInfos and SubsMapHelper.removeAllForNode. On Further look at the code, it's been done in a blocking fashion which is causing issues for us. On updating the plugin code to do it in async mode the issue goes away. Few Examples -> In IgniteClusterManager ->

private void cleanNodeInfos(String nid) {
      nodeInfoMap.removeAsync(nid)
        .listen((IgniteInClosure<IgniteFuture<Boolean>>) booleanIgniteFuture -> {
          boolean result = false;
          try {
            result = booleanIgniteFuture.get();
          } catch (IllegalStateException | CacheException e) {
            //ignore
            log.error("Failed  to remove node info", e);
          }
          if (result) {
            cleanSubs(nid);
          }

        });
  }

In subsMapHelper

public void removeAllForNode(String nodeId) {
    TreeSet<IgniteRegistrationInfo> toRemove = map.query(new ScanQuery<IgniteRegistrationInfo, Boolean>((k, v) -> k.registrationInfo().nodeId().equals(nodeId)))
      .getAll().stream()
      .map(Cache.Entry::getKey)
      .collect(Collectors.toCollection(TreeSet::new));
    map.removeAllAsync(toRemove).listen(fut -> {
      try {
        fut.get();
      } catch (IllegalStateException | CacheException e) {
        log.error("Failed  to remove all subscribers", e);
      }
    });
  }

  public void remove(String address, RegistrationInfo registrationInfo, Promise<Void> promise) {
    map.removeAsync(new IgniteRegistrationInfo(address, registrationInfo)).listen(fut -> {
      try {
        Boolean value = fut.get();
        if (value != null && value) {
          promise.complete();
        } else {
          promise.fail(new VertxException("Failed to remove Node"));
        }
      } catch (IllegalStateException | CacheException e) {
        promise.fail(e);
      }
    });
  }

Any reason this was being done in a blocking fashion ? I can create a PR if you agree with this new approach.

zyclonite commented 3 years ago

did you test 4.0.0 or already 4.0.2? as between these version, there were quite some issues fixed in that regards

switching to async methods for all operations would be worth a try, keep in mind that you will end up on an ignite thread so you need to switch back to the previous context when completing or failing the promise

maybe it would make sense to create a vertx listener to be used for the listen method which does handle the promise and switches the context...

arad1994 commented 3 years ago

@zyclonite Just tested with 4.0.0, will try with 4.0.2 later this week and let you know, if that works out of the box. Hmm the listener might make sense

arad1994 commented 3 years ago

@zyclonite I tried with the 4.0.2 and same result, eventually the thread starts blocking indefinitely. Is there an example for setting up the vertx Listeners, never done that before, wanted try out with listeners once. You were right about the async one, It eventually ends up on a ignite thread

zyclonite commented 3 years ago

maybe a good example is this one from the rabbitmq client

vietj commented 3 years ago

@zyclonite should we close this issue ?

zyclonite commented 3 years ago

@vietj yes i'll close it for now due to inactivity

vert-x3 / vertx-ignite