Closed aadant closed 2 weeks ago
Hi, so I'm wondering, based on looking at the JDK11 bug history, if this may not be related to some past issues with JDK11 and SSL connections hitting a deadlock scenario where you have this unique combination of the socket shutting down:
https://bugs.openjdk.org/browse/JDK-8207004 https://bugs.openjdk.org/browse/JDK-8207009
Can you please share what JDK vendor you're using? Would it be possible to upgrade to 11.0.20 and see whether this problem goes away, or better yet move to JDK17 as JDK11 is being EoL in October 2024?
@Naros thanks, looks like it, indeed we use this
# java --version
openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment 18.9 (build 11.0.16+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.16+8, mixed mode, sharing)
@Naros unfortunately the issue continues to happen with OpenJDK 17 ...
bash-4.4# java --version
openjdk 17.0.2 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
stack trace involving shyiko :
"blc-fpif1-txnrepodbl1:3306" #77 prio=5 os_prio=0 cpu=3406817.79ms elapsed=71952.46s tid=0x00007feb4423f7a0 nid=0x11a runnable [0x00007feb5affd000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.write0(java.base@17.0.2/Native Method)
at sun.nio.ch.SocketDispatcher.write(java.base@17.0.2/SocketDispatcher.java:62)
at sun.nio.ch.NioSocketImpl.tryWrite(java.base@17.0.2/NioSocketImpl.java:398)
at sun.nio.ch.NioSocketImpl.implWrite(java.base@17.0.2/NioSocketImpl.java:413)
at sun.nio.ch.NioSocketImpl.write(java.base@17.0.2/NioSocketImpl.java:440)
at sun.nio.ch.NioSocketImpl$2.write(java.base@17.0.2/NioSocketImpl.java:826)
at java.net.Socket$SocketOutputStream.write(java.base@17.0.2/Socket.java:1035)
at sun.security.ssl.SSLSocketOutputRecord.flush(java.base@17.0.2/SSLSocketOutputRecord.java:271)
at sun.security.ssl.OutputRecord.changeWriteCiphers(java.base@17.0.2/OutputRecord.java:224)
at sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce(java.base@17.0.2/KeyUpdate.java:321)
at sun.security.ssl.KeyUpdate$KeyUpdateKickstartProducer.produce(java.base@17.0.2/KeyUpdate.java:171)
at sun.security.ssl.SSLHandshake.kickstart(java.base@17.0.2/SSLHandshake.java:542)
at sun.security.ssl.PostHandshakeContext.kickstart(java.base@17.0.2/PostHandshakeContext.java:69)
at sun.security.ssl.TransportContext.kickstart(java.base@17.0.2/TransportContext.java:251)
at sun.security.ssl.SSLSocketImpl.startHandshake(java.base@17.0.2/SSLSocketImpl.java:443)
at sun.security.ssl.SSLSocketImpl.startHandshake(java.base@17.0.2/SSLSocketImpl.java:421)
at sun.security.ssl.SSLSocketImpl.tryKeyUpdate(java.base@17.0.2/SSLSocketImpl.java:1543)
at sun.security.ssl.SSLSocketImpl.decode(java.base@17.0.2/SSLSocketImpl.java:1515)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord(java.base@17.0.2/SSLSocketImpl.java:1474)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read(java.base@17.0.2/SSLSocketImpl.java:1059)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:266)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeNewDecimal(AbstractRowsEventDataDeserializer.java:258)
at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:169)
at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:143)
at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserializeRows(UpdateRowsEventDataDeserializer.java:72)
at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:58)
at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:33)
at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:337)
at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:246)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$1.nextEvent(MySqlStreamingChangeEventSource.java:238)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1068)
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:648)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:949)
at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)
"blc-keepalive-fpif1-txnrepodbl1:3306" #79 prio=5 os_prio=0 cpu=30.29ms elapsed=71952.43s tid=0x00007feb484dd850 nid=0x11c waiting on condition [0x00007feb5aefd000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
- parking to wait for <0x00007fefd423e348> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.2/ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.2/ReentrantLock.java:322)
at sun.security.ssl.SSLSocketImpl.closeNotify(java.base@17.0.2/SSLSocketImpl.java:726)
at sun.security.ssl.TransportContext.closeNotify(java.base@17.0.2/TransportContext.java:280)
at sun.security.ssl.TransportContext.initiateOutboundClose(java.base@17.0.2/TransportContext.java:591)
at sun.security.ssl.TransportContext.closeOutbound(java.base@17.0.2/TransportContext.java:575)
at sun.security.ssl.SSLSocketImpl.shutdownOutput(java.base@17.0.2/SSLSocketImpl.java:865)
at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.2/Executors.java:539)
at java.util.concurrent.FutureTask.run(java.base@17.0.2/FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.2/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.2/ThreadPoolExecutor.java:635)
at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)
Hi @aadant, can you describe the steps on how to reproduce it reliably?
It happens in the context of debezium embedded (Altinity sink connector). The problem occurs several times a week. There is an open issue there. It is a random issue though.
Hi, bringing this topic up again because we have encountered it in our company.
Given: AWS Aurora 3, AWS MSK, Confluent Kafka Connect 7.6.1, Debezium 2.6.1. openjdk version "11.0.22" 2024-01-16 LTS OpenJDK Runtime Environment Zulu11.70+15-CA (build 11.0.22+7-LTS) OpenJDK 64-Bit Server VM Zulu11.70+15-CA (build 11.0.22+7-LTS, mixed mode)
Problems started after MySQL upgrade from version 5.7 to 8 i.e. Aurora 2 -> 3. In the logs we see the message Keepalive: Trying to restore lost connection to xxxx-xxxx-xxx.rds.amazonaws.com:3306 and silence. First of all I tried setting the use.nongraceful.disconnect flag, but that had no effect. Then I started digging deeper. I modified your connector code a bit, added logs to it after each step and found two things:
About reliably reproducing the bug. Based on what I described above, namely that the hang happens on half-close, I'm not sure how it can be reproduced. In our case, Aurora DB sometimes has such extreme loads that it just dies for a few minutes, at which point Debezium stops getting heartbeats in response and tries to restart BinaryLogClient. In the case of a read socket close, an SSLException occurs:
javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:848)
at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:826)
at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:141)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1386)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1361)
at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1356)
at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1327)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1132)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1106)
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:655)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:958)
at java.base/java.lang.Thread.run(Thread.java:829)
it is ignored and proceeds to a write socket close, where the endless waiting takes place, because in our case the server is in a coma and not responding. How to reproduce this, I don't know. I hope my experiment will give someone useful insights.
P.S. Recommendation from me: please add more logs, at least with DEBUG level.
@comrada Debezium 2.6.1 contains 0.29.1 of binlog client. The issue should be fixed in 0.29.2. Could you please try to use it in yor deployment if it helps with the problem?
Hi @comrada , does the 0.29.2 work for you? I'm facing a similar issue after upgrading to Aurora v3.
@jpechane no, 0.29.2 does not fix this problem, because as I wrote in my comment, the infinite wait in my case happens while trying to close the socket for writing, and the 0.29.2 fix sets the SO_LINGER property after this call.
In my fix (based on 0.29.2), I moved the property setting to the beginning of the close() method and it works for me.
Hi @comrada We're also on Aurora v3 and running into the same deadlock issue, although I'm not sure if the disconnect happens because of aurora dying sometimes like you mentioned in your case. For this change you mention though, closing the socket without linger is generally discouraged from what I've read online. So I guess this should be directed towards the people that implemented this initially to address this issue.
But I wanted to ask if this can lead to any data loss? Has this fix been working fine for you ?
message in the log file
waiting forever
Given the code, it looks like this affects master ([0.28.3])