Open cpetzka opened 3 years ago
Also, we should have a changelog item (possibly linking to an issue or this PR that contains a rationale for the change) if we do choose to merge this.
I too am fairly unsure about this. Which is partly why I've left it alone apart from a few early comment (that and a whole load of other pain, work and non-work on my plate right now).
FWIW, I think there is definitely a case for supressing the exception that occurs in the following sequence of events;
CacheFactory#doSynchronousClusterTask(ClusterTask<T> task, boolean includeLocalMember)
or CacheFactory#doClusterTask(final ClusterTask<?> task)
That's the exception, I forgot to link the corresponding post from the forum. https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/7?u=chp and https://discourse.igniterealtime.org/t/null-pointer-exception-joining-a-room/88995/10?u=chp
2020.11.03 17:07:45.965 ERROR [hz.openfire.cached.thread-4] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.1-SNAPSHOT.jar!/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
2020.11.03 17:07:45.990 ERROR [hz.openfire.cached.thread-14] (org.jivesoftware.openfire.spi.RoutingTableImpl:279) - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 5ed5407e-e7fb-4b87-8ef0-9aff16dc07a0 not found in cluster
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:397) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:701) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:45) ~[hazelcast-2.5.1-SNAPSHOT.jar!/:?]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:359) ~[xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:262) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:165) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:79) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.spi.LocalMUCRoom.broadcast(LocalMUCRoom.java:1585) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest$1.run(BroadcastPresenceRequest.java:87) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.MUCRoomTask.execute(MUCRoomTask.java:81) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.muc.cluster.BroadcastPresenceRequest.run(BroadcastPresenceRequest.java:82) [xmppserver-4.6.0-SNAPSHOT.jar:4.6.0-SNAPSHOT]
at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$CallableTask.call(ClusteredCacheFactory.java:591) [hazelcast-2.5.1-SNAPSHOT.jar!/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_265]
at com.hazelcast.executor.impl.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:270) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227) [hazelcast-3.12.5.jar!/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_265]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64) [hazelcast-3.12.5.jar!/:?]
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80) [hazelcast-3.12.5.jar!/:?]
As you can see, the exception is thrown several times within a few milliseconds.
I looked at it again and found that catching the IllegalArgumentException (and maybe the IllgegalStateException as before) is probably sufficient. In https://github.com/igniterealtime/openfire-hazelcast-plugin/blob/master/src/java/org/jivesoftware/openfire/plugin/util/cache/ClusteredCacheFactory.java#L384, the check whether the node is in the cluster or not is already done and just this IllegalArgumentException is thrown. I.e. the check whether the node is still in the cluster is redundant if the corresponding exception is caught.
https://github.com/igniterealtime/Openfire/pull/1749 is to fix one trigger of the problem. Each cluster node has a local or remote user in its maps for all MUC members. If a node goes down, an OccupantLeftEvent is sent for each member (AND ON EACH NODE) that was online on the node (one after the other). I.e. the presence that the user is offline is also sent to ALL remote users, also to the ones that were online on the offline node and for which no OccupantLeftEvent was sent yet. The exception is thrown if the cache of the routing table has not yet been updated and the route to the user still exists there.
All exceptions are caught in RotuingTableImpl and only logged, so that this special exception can be suppressed here without any problems.
Also catch the IllegalArgumentException which occurs if the node is not part of the cluster. This can happen even if the isClusterMember check before is true.