osheroff / mysql-binlog-connector-java

MySQL Binary Log connector
641 stars 161 forks source link

Deadlock when connection try to disconnect #133

Closed aadant closed 2 weeks ago

aadant commented 7 months ago
"blc-fpif1-txnrepodbl1:3306" #31833 prio=5 os_prio=0 cpu=3676437.42ms elapsed=146955.63s tid=0x00007fe9bc001800 nid=0x7e58 runnable  [0x00007feb830d3000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketOutputStream.socketWrite0(java.base@11.0.16/Native Method)
        at java.net.SocketOutputStream.socketWrite(java.base@11.0.16/SocketOutputStream.java:110)
        at java.net.SocketOutputStream.write(java.base@11.0.16/SocketOutputStream.java:150)
        at sun.security.ssl.SSLSocketOutputRecord.flush(java.base@11.0.16/SSLSocketOutputRecord.java:271)
        at sun.security.ssl.OutputRecord.changeWriteCiphers(java.base@11.0.16/OutputRecord.java:224)
        at sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce(java.base@11.0.16/KeyUpdate.java:321)
        at sun.security.ssl.KeyUpdate$KeyUpdateKickstartProducer.produce(java.base@11.0.16/KeyUpdate.java:171)
        at sun.security.ssl.SSLHandshake.kickstart(java.base@11.0.16/SSLHandshake.java:538)
        at sun.security.ssl.PostHandshakeContext.kickstart(java.base@11.0.16/PostHandshakeContext.java:68)
        at sun.security.ssl.TransportContext.kickstart(java.base@11.0.16/TransportContext.java:233)
        at sun.security.ssl.SSLSocketImpl.startHandshake(java.base@11.0.16/SSLSocketImpl.java:449)
        at sun.security.ssl.SSLSocketImpl.startHandshake(java.base@11.0.16/SSLSocketImpl.java:427)
        at sun.security.ssl.SSLSocketImpl.tryKeyUpdate(java.base@11.0.16/SSLSocketImpl.java:1549)
        at sun.security.ssl.SSLSocketImpl.decode(java.base@11.0.16/SSLSocketImpl.java:1521)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord(java.base@11.0.16/SSLSocketImpl.java:1476)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(java.base@11.0.16/SSLSocketImpl.java:1065)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(java.base@11.0.16/SSLSocketImpl.java:972)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:226)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:207)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readInteger(ByteArrayInputStream.java:52)
        at io.debezium.connector.mysql.RowDeserializers.deserializeDate(RowDeserializers.java:298)
        at io.debezium.connector.mysql.RowDeserializers$WriteRowsDeserializer.deserializeDate(RowDeserializers.java:215)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:171)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:143)
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserializeRows(WriteRowsEventDataDeserializer.java:64)
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserialize(WriteRowsEventDataDeserializer.java:56)
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserialize(WriteRowsEventDataDeserializer.java:32)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:337)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:246)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$1.nextEvent(MySqlStreamingChangeEventSource.java:238)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1068)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:648)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:949)
        at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

"blc-keepalive-fpif1-txnrepodbl1:3306" #31835 prio=5 os_prio=0 cpu=53.87ms elapsed=146955.62s tid=0x00007fe9b8001000 nid=0x7e5a waiting on condition  [0x00007feb8190c000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
        - parking to wait for  <0x00007ff27751cfa0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.16/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.16/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.16/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.16/ReentrantLock.java:267)
        at sun.security.ssl.SSLSocketImpl.closeNotify(java.base@11.0.16/SSLSocketImpl.java:732)
        at sun.security.ssl.TransportContext.closeNotify(java.base@11.0.16/TransportContext.java:262)
        at sun.security.ssl.TransportContext.initiateOutboundClose(java.base@11.0.16/TransportContext.java:573)
        at sun.security.ssl.TransportContext.closeOutbound(java.base@11.0.16/TransportContext.java:557)
        at sun.security.ssl.SSLSocketImpl.shutdownOutput(java.base@11.0.16/SSLSocketImpl.java:871)
        at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.16/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java.base@11.0.16/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.16/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.16/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

message in the log file

INFO: Keepalive: Trying to restore lost connection to fpif1-txnrepodbl1:3306

waiting forever

Given the code, it looks like this affects master ([0.28.3])

Naros commented 7 months ago

Hi, so I'm wondering, based on looking at the JDK11 bug history, if this may not be related to some past issues with JDK11 and SSL connections hitting a deadlock scenario where you have this unique combination of the socket shutting down:

https://bugs.openjdk.org/browse/JDK-8207004 https://bugs.openjdk.org/browse/JDK-8207009

Can you please share what JDK vendor you're using? Would it be possible to upgrade to 11.0.20 and see whether this problem goes away, or better yet move to JDK17 as JDK11 is being EoL in October 2024?

aadant commented 7 months ago

@Naros thanks, looks like it, indeed we use this

# java --version
openjdk 11.0.16 2022-07-19
OpenJDK Runtime Environment 18.9 (build 11.0.16+8)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.16+8, mixed mode, sharing)
aadant commented 7 months ago

@Naros unfortunately the issue continues to happen with OpenJDK 17 ...

bash-4.4# java --version
openjdk 17.0.2 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
aadant commented 7 months ago

stack trace involving shyiko :

"blc-fpif1-txnrepodbl1:3306" #77 prio=5 os_prio=0 cpu=3406817.79ms elapsed=71952.46s tid=0x00007feb4423f7a0 nid=0x11a runnable  [0x00007feb5affd000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.write0(java.base@17.0.2/Native Method)
        at sun.nio.ch.SocketDispatcher.write(java.base@17.0.2/SocketDispatcher.java:62)
        at sun.nio.ch.NioSocketImpl.tryWrite(java.base@17.0.2/NioSocketImpl.java:398)
        at sun.nio.ch.NioSocketImpl.implWrite(java.base@17.0.2/NioSocketImpl.java:413)
        at sun.nio.ch.NioSocketImpl.write(java.base@17.0.2/NioSocketImpl.java:440)
        at sun.nio.ch.NioSocketImpl$2.write(java.base@17.0.2/NioSocketImpl.java:826)
        at java.net.Socket$SocketOutputStream.write(java.base@17.0.2/Socket.java:1035)
        at sun.security.ssl.SSLSocketOutputRecord.flush(java.base@17.0.2/SSLSocketOutputRecord.java:271)
        at sun.security.ssl.OutputRecord.changeWriteCiphers(java.base@17.0.2/OutputRecord.java:224)
        at sun.security.ssl.KeyUpdate$KeyUpdateProducer.produce(java.base@17.0.2/KeyUpdate.java:321)
        at sun.security.ssl.KeyUpdate$KeyUpdateKickstartProducer.produce(java.base@17.0.2/KeyUpdate.java:171)
        at sun.security.ssl.SSLHandshake.kickstart(java.base@17.0.2/SSLHandshake.java:542)
        at sun.security.ssl.PostHandshakeContext.kickstart(java.base@17.0.2/PostHandshakeContext.java:69)
        at sun.security.ssl.TransportContext.kickstart(java.base@17.0.2/TransportContext.java:251)
        at sun.security.ssl.SSLSocketImpl.startHandshake(java.base@17.0.2/SSLSocketImpl.java:443)
        at sun.security.ssl.SSLSocketImpl.startHandshake(java.base@17.0.2/SSLSocketImpl.java:421)
        at sun.security.ssl.SSLSocketImpl.tryKeyUpdate(java.base@17.0.2/SSLSocketImpl.java:1543)
        at sun.security.ssl.SSLSocketImpl.decode(java.base@17.0.2/SSLSocketImpl.java:1515)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord(java.base@17.0.2/SSLSocketImpl.java:1474)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(java.base@17.0.2/SSLSocketImpl.java:1059)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:266)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeNewDecimal(AbstractRowsEventDataDeserializer.java:258)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataDeserializer.java:169)
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:143)
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserializeRows(UpdateRowsEventDataDeserializer.java:72)
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:58)
        at com.github.shyiko.mysql.binlog.event.deserialization.UpdateRowsEventDataDeserializer.deserialize(UpdateRowsEventDataDeserializer.java:33)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:337)
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:246)
        at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$1.nextEvent(MySqlStreamingChangeEventSource.java:238)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1068)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:648)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:949)
        at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)

"blc-keepalive-fpif1-txnrepodbl1:3306" #79 prio=5 os_prio=0 cpu=30.29ms elapsed=71952.43s tid=0x00007feb484dd850 nid=0x11c waiting on condition  [0x00007feb5aefd000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
        - parking to wait for  <0x00007fefd423e348> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:938)
        at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.2/ReentrantLock.java:153)
        at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.2/ReentrantLock.java:322)
        at sun.security.ssl.SSLSocketImpl.closeNotify(java.base@17.0.2/SSLSocketImpl.java:726)
        at sun.security.ssl.TransportContext.closeNotify(java.base@17.0.2/TransportContext.java:280)
        at sun.security.ssl.TransportContext.initiateOutboundClose(java.base@17.0.2/TransportContext.java:591)
        at sun.security.ssl.TransportContext.closeOutbound(java.base@17.0.2/TransportContext.java:575)
        at sun.security.ssl.SSLSocketImpl.shutdownOutput(java.base@17.0.2/SSLSocketImpl.java:865)
        at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:125)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1361)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1345)
        at com.github.shyiko.mysql.binlog.BinaryLogClient.access$800(BinaryLogClient.java:96)
        at com.github.shyiko.mysql.binlog.BinaryLogClient$5.run(BinaryLogClient.java:894)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@17.0.2/Executors.java:539)
        at java.util.concurrent.FutureTask.run(java.base@17.0.2/FutureTask.java:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.2/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.2/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(java.base@17.0.2/Thread.java:833)
Naros commented 7 months ago

Hi @aadant, can you describe the steps on how to reproduce it reliably?

aadant commented 7 months ago

It happens in the context of debezium embedded (Altinity sink connector). The problem occurs several times a week. There is an open issue there. It is a random issue though.

comrada commented 1 month ago

Hi, bringing this topic up again because we have encountered it in our company.

Given: AWS Aurora 3, AWS MSK, Confluent Kafka Connect 7.6.1, Debezium 2.6.1. openjdk version "11.0.22" 2024-01-16 LTS OpenJDK Runtime Environment Zulu11.70+15-CA (build 11.0.22+7-LTS) OpenJDK 64-Bit Server VM Zulu11.70+15-CA (build 11.0.22+7-LTS, mixed mode)

Problems started after MySQL upgrade from version 5.7 to 8 i.e. Aurora 2 -> 3. In the logs we see the message Keepalive: Trying to restore lost connection to xxxx-xxxx-xxx.rds.amazonaws.com:3306 and silence. First of all I tried setting the use.nongraceful.disconnect flag, but that had no effect. Then I started digging deeper. I modified your connector code a bit, added logs to it after each step and found two things:

  1. For some reason, setting the use.nongraceful.disconnect flag in the Debezium connector does not cause the useNonGracefulDisconnect field to be set to the appropriate value. I verified this by adding the output of the field values to the log here. So I just hardcoded true wherever that flag is used.
  2. The hang occurs when attempting a half-close here, and the workaround implemented here added setting SO_LINGER after the half-close attempt but before closing. I'm not an expert in TCP, but I assumed that this option had something to do with the half-close, so I moved this block up to the beginning of the close() method and it worked, the connection immediately started to close and the connector re-established the connection.

About reliably reproducing the bug. Based on what I described above, namely that the hang happens on half-close, I'm not sure how it can be reproduced. In our case, Aurora DB sometimes has such extreme loads that it just dies for a few minutes, at which point Debezium stops getting heartbeats in response and tries to restart BinaryLogClient. In the case of a read socket close, an SSLException occurs:

javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
    at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:848)
    at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:826)
    at com.github.shyiko.mysql.binlog.network.protocol.PacketChannel.close(PacketChannel.java:141)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnectChannel(BinaryLogClient.java:1386)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1361)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.terminateConnect(BinaryLogClient.java:1356)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.disconnect(BinaryLogClient.java:1327)
    at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1132)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:1106)
    at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:655)
    at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:958)
    at java.base/java.lang.Thread.run(Thread.java:829)

it is ignored and proceeds to a write socket close, where the endless waiting takes place, because in our case the server is in a coma and not responding. How to reproduce this, I don't know. I hope my experiment will give someone useful insights.

P.S. Recommendation from me: please add more logs, at least with DEBUG level.

jpechane commented 1 month ago

@comrada Debezium 2.6.1 contains 0.29.1 of binlog client. The issue should be fixed in 0.29.2. Could you please try to use it in yor deployment if it helps with the problem?

phamlehieu commented 1 month ago

Hi @comrada , does the 0.29.2 work for you? I'm facing a similar issue after upgrading to Aurora v3.

comrada commented 1 month ago

@jpechane no, 0.29.2 does not fix this problem, because as I wrote in my comment, the infinite wait in my case happens while trying to close the socket for writing, and the 0.29.2 fix sets the SO_LINGER property after this call. In my fix (based on 0.29.2), I moved the property setting to the beginning of the close() method and it works for me. image

Vittalgc commented 2 weeks ago

Hi @comrada We're also on Aurora v3 and running into the same deadlock issue, although I'm not sure if the disconnect happens because of aurora dying sometimes like you mentioned in your case. For this change you mention though, closing the socket without linger is generally discouraged from what I've read online. So I guess this should be directed towards the people that implemented this initially to address this issue.

But I wanted to ask if this can lead to any data loss? Has this fix been working fine for you ?