igniterealtime / openfire-hazelcast-plugin

Adds support for running multiple redundant Openfire servers together in a cluster
10 stars 12 forks source link

Block the execution of cluster tasks on non existing nodes #55

Open cpetzka opened 3 years ago

cpetzka commented 3 years ago

Also catch the IllegalArgumentException which occurs if the node is not part of the cluster. This can happen even if the isClusterMember check before is true.

guusdk commented 3 years ago

Also, we should have a changelog item (possibly linking to an issue or this PR that contains a rationale for the change) if we do choose to merge this.

GregDThomas commented 3 years ago

I too am fairly unsure about this. Which is partly why I've left it alone apart from a few early comment (that and a whole load of other pain, work and non-work on my plate right now).

GregDThomas commented 3 years ago

FWIW, I think there is definitely a case for supressing the exception that occurs in the following sequence of events;

  1. Client code sends a ClusterTask to all other nodes in the cluster (A, B) - CacheFactory#doSynchronousClusterTask(ClusterTask<T> task, boolean includeLocalMember) or CacheFactory#doClusterTask(final ClusterTask<?> task)
  2. Openfire sends ClusterTask to node A
  3. Node B leaves the cluster
  4. Openfire sends ClusterTask to node B
  5. Hazelcast throws exception that node B is not in the cluster.
cpetzka commented 3 years ago

That's the exception, I forgot to link the corresponding post from the forum. https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/7?u=chp and https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/10?u=chp

2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
    at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
    at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.1-SNAPSHOT.jar!/:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
    at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
    at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
    at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
    at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2020.11.03 17:07:45.990 ERROR [hz.openfire.cached.thread-14] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
    at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
    at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.1-SNAPSHOT.jar!/:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
    at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
    at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
    at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
    at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]

As you can see, the exception is thrown several times within a few milliseconds.

I looked at it again and found that catching the IllegalArgumentException (and maybe the IllgegalStateException as before) is probably sufficient. In https://github.com/igniterealtime/openfire-hazelcast-plugin/blob/master/src/java/org/jivesoftware/openfire/plugin/util/cache/ClusteredCacheFactory.java#L384, the check whether the node is in the cluster or not is already done and just this IllegalArgumentException is thrown. I.e. the check whether the node is still in the cluster is redundant if the corresponding exception is caught.

https://github.com/igniterealtime/Openfire/pull/1749 is to fix one trigger of the problem. Each cluster node has a local or remote user in its maps for all MUC members. If a node goes down, an OccupantLeftEvent is sent for each member (AND ON EACH NODE) that was online on the node (one after the other). I.e. the presence that the user is offline is also sent to ALL remote users, also to the ones that were online on the offline node and for which no OccupantLeftEvent was sent yet. The exception is thrown if the cache of the routing table has not yet been updated and the route to the user still exists there.

All exceptions are caught in RotuingTableImpl and only logged, so that this special exception can be suppressed here without any problems.