initiating connection migration with tmb?

javafanboy commented 1 year ago

We are getting quite frequent warnings in our Coherence log on storage enabled nodes mentioning "initiating connection migration with tmb" and I would like some info about what it means on a more technical level and perhaps suggestions on most common causes?

I am guessing some problem on the physical network as well as long GC pauses could result in more or less any network related warning in Coherence but are there also other possible reasons and are there any tips on how to further debug the problem?

2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.25:33316.40573 after 15s ack timeout health(read=false, write=true), receiptWait=Message "PartialValueResponse" { FromMember=Member(Id=17, Timestamp=2023-03-12 17:53:14.974, Address=138.106.96.41:9001, MachineId=46694, Location=site:sss.se.xxxxyyyy.com,machine:l4041p.sss.se.xxxxyyyy.com,process:391,member:l4041p-2, Role=storagenode) FromMessageId=0 MessagePartCount=0 PendingCount=0 BufferCounter=1 MessageType=70 ToPollId=19827300 Poll=null Packets { } Service=PartitionedCache{Name=DistributedCache, State=(SERVICE_STARTED), Id=3, OldestMemberId=1, LocalStorage=enabled, PartitionCount=601, BackupCount=1, AssignedPartitions=18, BackupPartitions=19, CoordinatorId=1} ToMemberSet=MemberSet(Size=1 Member(Id=179, Timestamp=2023-03-12 18:43:36.847, Address=138.106.96.25:33316, MachineId=48549, Location=site:sss.se.xxxxyyyy.com,machine:l4025p,process:3473,member:l4025p_11990, Role=scex) ) NotifyDelivery=false }: peer=tmb://138.106.96.25:33316.40573, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.25,port=38114,localport=9001]}, migrations=17, bytes(in=104371345, out=101784244), flushlock false, bufferedOut=0B, unflushed=0B, delivered(in=203177, out=197772), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=95922, out=99203/99206) 2023-03-13 20:17:56.310/95085.948 Oracle Coherence CE 14.1.1.0.12 (thread=SelectionService(channels=112, selector=MultiplexedSelector(sun.nio.ch.EPollSelectorImpl@3676ac27), id=1660451908), member=17): tmb://138.106.96.41:9001.49982 initiating connection migration with tmb://138.106.96.32:41070.40752 after 15s ack timeout health(read=true, write=false), receiptWait=null: peer=tmb://138.106.96.32:41070.40752, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/138.106.96.32,port=41070,localport=36388]}, migrations=5, bytes(in=95752773, out=99458811), flushlock false, bufferedOut=1.54KB, unflushed=0B, delivered(in=192506, out=187239), timeout(ack=0ns), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=90667, out=93950/93953)

javafanboy commented 1 year ago

We continue seeing this message semi frequently for several days so I do not think we have any underlying network problem (we run on an enterprise class DC network where any disturbances are resolved quickly) and I can't see any excessive GC pauses in our log...

Is this a message we need to act on or not critical?

javafanboy commented 1 year ago

Seem to be triggered in class com.oracle.coherence.common.internal.net.socketbus.BufferedSocketBus method checkHealth(long) but it is still not clear to me exactly what ack it is that is timing out or what the code is trying to "do about it" i.e. the "migration":

        else if (ldtNow >= ldtAckTimeout)
            {
            // timeout expired
            final int cMultCap = 10;
            long      cMillisTimeout = f_driver.getDependencies().getAckTimeoutMillis();
            Duration  dur            = new Duration(cMillisTimeout * Math.min(cMultCap, m_cUnackLast + 1), Duration.Magnitude.MILLI);

            getLogger().log(makeRecord(Level.WARNING,
                    "{0} initiating connection migration with {1} after {2} ack timeout health(read={3}, write={4}), receiptWait={5}: {6}",
                    getLocalEndPoint(), getPeer(), dur, fReadHealthy, fWriteHealthy, oReceiptUnacked, BufferedConnection.this));

mgamanho commented 1 year ago

Hi, these messages can be quite common depending on system size, load, context (application busy, GC, ...) and as you can see they are harmless.

The heartbeat (health check) is exchanged regularly by cluster members to ensure the cluster is whole. Everyone looks after each other, so that generates a fair amount of ancillary activity for which once in a while we detect issues. At that point we decide to "migrate" the connection, which is merely opening a new connection and ditching the old one. In some cases a "fresh" connection may resolve OS or JVM congestion issues. In most cases it is just a precaution that has no consequences.

These heartbeat messages are purely for housekeeping and they are timed relatively aggressively to weed out potential issues. Application/data traffic is not impacted. If you do see them frequently, however, they are an indication that some tuning may be necessary: heap and GC pressure, network health, native memory (OS/hardware) for example. Running network performance tests can show you if network settings are ok, they can be eye opening.

Let us know how it goes.

javafanboy commented 1 year ago

Thanks a lot for the info - will continue investigating!

On Fri, Mar 17, 2023, 17:49 Maurice Gamanho @.***> wrote:

Hi, these messages can be quite common depending on system size, load, context (application busy, GC, ...) and as you can see they are harmless.

The heartbeat (health check) is exchanged regularly by cluster members to ensure the cluster is whole. Everyone looks after each other, so that generates a fair amount of ancillary activity for which once in a while we detect issues. At that point we decide to "migrate" the connection, which is merely opening a new connection and ditching the old one. In some cases a "fresh" connection may resolve OS or JVM congestion issues. In most cases it is just a precaution that has no consequences.

These heartbeat messages are purely for housekeeping and they are timed relatively aggressively to weed out potential issues. Application/data traffic is not impacted. If you do see them frequently, however, they are an indication that some tuning may be necessary: heap and GC pressure, network health, native memory (OS/hardware) for example. Running network performance tests https://docs.oracle.com/en/middleware/standalone/coherence/14.1.1.0/administer/performing-network-performance-test.html can show you if network settings are ok, they can be eye opening.

Let us know how it goes.

— Reply to this email directly, view it on GitHub https://github.com/oracle/coherence/issues/98#issuecomment-1474120654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXQF42BBGGT42MGFDNXMTW4SI2DANCNFSM6AAAAAAV2EBS3E . You are receiving this because you authored the thread.Message ID: @.***>

oracle / coherence

initiating connection migration with tmb? #98