spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.56k stars 513 forks source link

JDBC IO: pipeline gets stuck on attempt to write to Postgres #4047

Open stormy-ua opened 3 years ago

stormy-ua commented 3 years ago

There is a pipeline which has been consistently getting stuck on attempt to write to JDBC. The thread dump on one worker revealed a bunch of threads waiting for a new connection to be allocated:

   java.lang.Thread.State: WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <45ac842e> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:581)
        at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:437)
        at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:354)
        at org.apache.commons.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:134)
        at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:734)
        at org.apache.beam.sdk.io.jdbc.JdbcIO$WriteVoid$WriteFn.executeBatch(JdbcIO.java:1449)
        at org.apache.beam.sdk.io.jdbc.JdbcIO$WriteVoid$WriteFn.processElement(JdbcIO.java:1398)
        at org.apache.beam.sdk.io.jdbc.JdbcIO$WriteVoid$WriteFn$DoFnInvoker.invokeProcessElement(Unknown Source)

There were 11 such threads waiting for a new connection from a pool. Other workers were idle. It looks like this single worker was holding the watermark back and the pipeline stopped making any progress and appeared as stuck. The default maximum number of connections is 8 according to this and beam neither overrides nor exposes it as a separate config for bumping it. In its turn scio doesn't support this as well. There is a break-glass approach how to configure it and was referenced in the BEAM-9629.

This work should be also done together with an investigation into why DB connections aren't reused. Does a failed batch leaks a DB connection and it is never returned to the pool?

stormy-ua commented 2 years ago

Beam issue to expose max connections in pool as a setting - https://issues.apache.org/jira/browse/BEAM-13261

kellen commented 2 years ago

Moving to 0.12.1 as this is not yet fixed in beam