oracle / oracle-r2dbc

R2DBC Driver for Oracle Database
https://oracle.com
Other
193 stars 40 forks source link

Issue with microservice using r2dbc oracle inside kubernetes cluster. (Request timeouts) #129

Closed manuelgdlvh closed 8 months ago

manuelgdlvh commented 11 months ago

When this app is deployed in a local Kubernetes cluster with a basic configuration, when in our service execute a query pointing to a Oracle database using r2dbc oracle driver it's stuck with no response on a infinite wait without errors (session is not created on DB)

If we test exactly same but running in local or container (without Kubernetes), it works.

This is all tests done:

Michael-A-McMahon commented 10 months ago

Hi @ManuelGDLVH. Thanks for reporting this.

If no database session is created, then it sounds like the driver is stuck at some step when opening a connection. Would it be possible to share a thread dump, such as one from a jstack command? This might tell us where the driver has become stuck.

manuelgdlvh commented 10 months ago

Hi @Michael-A-McMahon this is the thread dump: threaddump.txt For all you need, im here!

Michael-A-McMahon commented 10 months ago

Thanks for the thread dump. I can't pin point a root cause just yet, but we do have some information. I have some questions which I think will help us find a resolution:

Have you tried opening a plain JDBC connection in this environment? I wonder if Oracle R2DBC is sending a login request to the database, and simply not receiving a response? If this is the case, then we should be able to see the same behavior with Oracle JDBC.

Are there any errors appearing in logs or stderr? It may be the case that the connection is failing, and the Exception is getting dropped somewhere along the way.

Is it possible that -Djava.util.concurrent.ForkJoinPool.common.parallelism=0? If this system property is set, then Oracle R2DBC/JDBC will have no threads to execute asynchronous callbacks. The driver will submit a task to the common ForkJoinPool, which will then silently fail to execute it.

Below are just my notes about the thread dump:

A thread named "oracle.jdbc.driver.BlockSource.ThreadedCachingBlockSource.BlockReleaser" is present. This thread is used by Oracle JDBC to manage a memory pool for bind data and row data. It is only started when a connection is requested from the driver. So we know that some part of this application, possibly Oracle R2DBC, requested a java.sql.Connection from the Oracle JDBC Driver.

A thread named "oracle.net.nt.TcpMultiplexer" is present. This thread is used by Oracle JDBC to poll for network readiness. It is only started when a method of Oracle JDBC's Reactive Extensions is called. So we know that some part of this application, almost certainly Oracle R2DBC, called a reactive method of Oracle JDBC.

No thread appears to be stuck at oracle.jdbc.driver.RestrictedLock.lock(RestrictedLock.java:247). This was a symptom of deadlocks we saw in earlier releases of Oracle R2DBC. (https://github.com/oracle/oracle-r2dbc/issues/48#issuecomment-914304356). We can rule this out as a root cause.

No thread having a name like "ForkJoinPool.commonPool-worker-" is present. By default, Oracle R2DBC/JDBC use the common ForkJoinPool to execute callbacks for reactive methods. If a callback has ever executed, then we should see at least one common FJP thread. I did note that the JavaDoc of FJP states "... threads are slowly reclaimed during periods of non-use ...". So could this thread dump have occured after an period of non-use, when all common FJP threads had been stopped? This might explain why we don't see them in the dump. By default, the number of common FJP threads appears to be processors minus 1. However, if there is just one processor, then there is 1 common FJP thread. A non-default number of common FJP threads can be configured with -Djava.util.concurrent.ForkJoinPool.common.parallelism. If this system property is set to 0, then tasks submitted to the common FJP are never executed (at least never executed asynchronously). Another possibility is that Oracle JDBC simply never received a network response from the database. If no response is received, then no async callback is executed, and no common FJP thread ever needs to be created.

manuelgdlvh commented 10 months ago

Hi @Michael-A-McMahon , the most strange thing is that with jdbc inside k8s cluster works well and with the service dockerized and running outside cluster works with jdbc and r2dbc.

Michael-A-McMahon commented 10 months ago

If Oracle JDBC can connect synchronously, then something specific to its async/reactive code must be causing the failure. Could be something about using java.nio.channels.Selector, or something about using the common ForkJoinPool. As we don't see any FJP threads in the dump, I can think of two possibilities:

  1. The Selector never sees read-readiness on the socket, meaning that the database never responded to JDBC's connection request.
  2. The common FJP is configured to use 0 threads. I believe that can only happen if java.util.concurrent.ForkJoinPool.common.parallelism=0.

The first possibility seems more likely at this point, as the second would require a non-default system property setting which I think you've already checked for. We could completely rule out the second possibility by testing with: -Djava.util.concurrent.ForkJoinPool.common.parallelism=1

As an experiment, we could also try updating the ojdbc11 dependency:

      <groupId>com.oracle.database.jdbc</groupId>
      <artifactId>ojdbc11</artifactId>
      <version>23.2.0.0</version>

Oracle R2DBC currently depends on the 21.7.0.0 build. But code for reactive extensions has been significantly overhauled in the 23 version. It's possible we would get a different result with 23.

*Edited the artifactId

manuelgdlvh commented 10 months ago

Hi @Michael-A-McMahon , I did the lasts tests using the jvm flag that you said me and now its working properly. I will investigate the interaction between kubernetes - jvm because using the same image in docker works. All information found will be shared with you and community. Thanks for all!

Michael-A-McMahon commented 10 months ago

That is interesting! I was convinced that FJP should have at least 1 thread, unless explicitly configured otherwise. But it sounds like you had to explicitly configure parallelism to be 1 or greater. Looking forward to learning more.

manuelgdlvh commented 10 months ago

Perhaps that is a bug related with the containerized environments and the JVM. Link to Bug here: https://bugs.openjdk.org/browse/JDK-8279484

Michael-A-McMahon commented 10 months ago

Fascinating. I think Oracle R2DBC could provide a safe guard for this situation, something rooted in this pseudo-ish code:

static final Executor DEFAULT_EXECUTOR;
static {
  int commonFjpParallelism = ForkJoinPool.commonPool().parallelism();
  if (commonFjpParallelism < 1)
    DEFAULT_EXECUTOR = Executors.newSingleThreadExecutor();
  else
    DEFAULT_EXECUTOR = ForkJoinPool.commonPool();
}

Let's leave this issue open so I can look further into a fix. This is a nasty bug, and I don't want other users to hit it.

Michael-A-McMahon commented 10 months ago

Hi again @ManuelGDLVH. Could you share the JDK version where this issue occurred? I'm wondering if it may have been caused by https://bugs.openjdk.org/browse/JDK-8274349

manuelgdlvh commented 10 months ago

Hi @Michael-A-McMahon,

The JDK version where the issue occurs for me is the one provided by the openjdk17-alpine Docker image.

image

In addition to the solution of adding the JVM flag for parallelism, I also tested setting the minimum CPU and RAM resources in the Kubernetes deployment, and it worked well too. In my test, I set the minimum CPU to 1 and RAM to 1GB.

I believe that when using only Docker, the JVM can effectively read the resources of the host machine. However, in Kubernetes, if we don't provide minimum resource settings, it may not be able to correctly determine the available machine resources.

Anything I can help you with, feel free to contact me.

Michael-A-McMahon commented 9 months ago

Hi @ManuelGDLVH. Very sorry I left you hanging on this. I got wrapped up in preparations for CloudWorld 2023.

Your JDK version matches that of the afflicted version in https://bugs.openjdk.org/browse/JDK-8274349

If you update to the latest JDK 17, I think you'll find the issue resolves itself without having to set "java.util.concurrent.ForkJoinPool.common.parallelism", and without having to configure resources of Kubernetes. I could be wrong about this, so it would be good if you have a chance to test and verify.

At this point, I think the only case Oracle R2DBC needs to check for is "java.util.concurrent.ForkJoinPool.common.parallelism=0". That should be the only way we get an FJP with zero threads. The other way, which I think you had, was a bug in the JDK. If the container has one CPU core, the common FJP would incorrectly configure itself with zero threads. The correct behavior, which is present in later builds of JDK 17, is to have 1 thread.

volcanic-tephra commented 8 months ago

I've just spent days troubleshooting an AWS Fargate deployment where the request to get a connection would hang. This was because the ForkJoinPool.commonPool (default) had no free threads.

To work around this I supplied the connection factory with a custom executor:

        var executor = Executors.newFixedThreadPool(threads);

        var factory = ConnectionFactories.get(
                ConnectionFactoryOptions.builder()
                  .option(OracleR2dbcOptions.EXECUTOR, executor)
                  .option(ConnectionFactoryOptions.DRIVER, "pool")
                  .option(ConnectionFactoryOptions.PROTOCOL, "oracle")
                  .option(ConnectionFactoryOptions.HOST, properties.getHost())
                  .option(ConnectionFactoryOptions.PORT, properties.getPort())
                  .option(ConnectionFactoryOptions.DATABASE, properties.getDatabase())
                  .option(ConnectionFactoryOptions.USER, properties.getUser())
                  .option(ConnectionFactoryOptions.CONNECT_TIMEOUT, properties.getConnectionTimeout() == null 
                                                                    ? Duration.ofSeconds(10) : properties.getConnectionTimeout())
                  .option(ConnectionFactoryOptions.STATEMENT_TIMEOUT, properties.getStatementTimeout() == null 
                    ? Duration.ofSeconds(10) : properties.getStatementTimeout())                  
                  .option(ConnectionFactoryOptions.PASSWORD, properties.getPassword())
                  .build());

I raised this issue in https://github.com/oracle/oracle-r2dbc/issues/136