neo4j / neo4j-java-driver

Neo4j Bolt driver for Java
Apache License 2.0
329 stars 155 forks source link

Connection timeout & connection liveness check ignored or not working correctly #653

Closed haris-zynka closed 4 years ago

haris-zynka commented 4 years ago

I got "java.io.IOException: An existing connection was forcibly closed by the remote host" after at least 19 seconds locally and on Azure it probably takes over 15 minutes (as I stopped after 15 and didn't check if it actually resets ever). I tried to customise connection acquisition timeout, and connection timeout, connection liveness check timeout. Whatever numbers I put in it still produces the same problem.

The latest config looked like this

.withConnectionAcquisitionTimeout(1, TimeUnit.SECONDS)
                .withConnectionTimeout(1, TimeUnit.SECONDS)
                .withConnectionLivenessCheckTimeout(500, TimeUnit.MILLISECONDS)
                .withMaxConnectionPoolSize(3)
                .withMaxTransactionRetryTime(2, TimeUnit.SECONDS)

And the log looks like this after it manages to reset

Nov 25, 2019 1:51:40 PM org.neo4j.driver.internal.logging.JULogger warn
WARNING: [0xe38c169a][94.245.90.10:7687][bolt-745] Fatal error occurred in the pipeline
java.io.IOException: An existing connection was forcibly closed by the remote host
    at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
    at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
    at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)
    at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:233)
    at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)
    at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358)
    at org.neo4j.driver.internal.shaded.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:247)
    at org.neo4j.driver.internal.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1147)
    at org.neo4j.driver.internal.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
    at org.neo4j.driver.internal.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
    at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
    at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
    at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
    at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
    at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
    at org.neo4j.driver.internal.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:834)

Neo4j Version: 3.5.12 Enterprise Neo4j Mode: Single instance
Driver version: Java driver 4.0.0 Operating System: Windows 10 / Ubuntu 18 on Azure OGM: 3.2.2 Other: Micronaut 1.2.6.

Steps to reproduce

  1. Make and HTTP call that calls the DB using driver
  2. Wait some time (takes some time like 20 minutes for connections to expire)
  3. Make another HTTP call that forces app to call DB

    Expected behavior

    Wait at most 500 ms to check the connection. Wait a bit more to create a new one and execute.

    Actual behavior

    Waits longer than 10 seconds to realise that all connections are off and has to create the new one.

zhenlineo commented 4 years ago

Hi @haris-zynka,

Try withMaxConnectionLifetime=15mins. This will ensure that connections are directly closed after 15mins before Azure to mark your connections timed out.

All your other connection settings are not directly related to the problem you are trying to solve. You can read more about connection settings here. It is currently for 1.7 drivers, but the configs are almost the same in 4.0.

Let us know if this solves your problem. Cheers, Zhen

haris-zynka commented 4 years ago

Thanks for the answer @zhenlineo . I'll try this one and report back or close the issue if it's fixed.

haris-zynka commented 4 years ago

@zhenlineo it appears that has helped with the problem in a way. Now I get exceptions which fail request and tell me to "check if database is available" and it gives me the IP and the port of the database. For now this is quite better and I guess can be handled this way. Although I'm still wondering why did it take initially so long for driver to realise that connection is not available and to start new connection.

zhenlineo commented 4 years ago

Hi @haris-zynka,

If you run your queries inside a transaction function, a.k.a. session.readTransaction and session.writeTransaction, then your transactions would be automatically retried if some connection error happens.

To answer your question, why it initially takes so long for driver detect connection is not available. It is because when Azure times out a connection, it does not shut down tcp properly with three-way-handshake. It probably just drops any incoming data on that port. As a result, the driver can only rely on OS level TCP retransmit timeout to detect the connection is gone. We have received many Azure users report this issue. The connectionLifeTime is added just to solve this issue.

Cheers, Zhen

haris-zynka commented 4 years ago

Hi @haris-zynka,

If you run your queries inside a transaction function, a.k.a. session.readTransaction and session.writeTransaction, then your transactions would be automatically retried if some connection error happens.

To answer your question, why it initially takes so long for driver detect connection is not available. It is because when Azure times out a connection, it does not shut down tcp properly with three-way-handshake. It probably just drops any incoming data on that port. As a result, the driver can only rely on OS level TCP retransmit timeout to detect the connection is gone. We have received many Azure users report this issue. The connectionLifeTime is added just to solve this issue.

Cheers, Zhen

Thank you so much. I was wondering why does server take longer than local PC. On the other hand I experienced too much problems with Azure for now so I'll check other solutions and also I'll check pricing for the neo4j cloud as neo4j itself for development purposes is also on Azure deployed as VM where I can't even change password (linux connection plugin issue) or logg in to machine after creating it if I'm not an protal admin (although I created machine with the user he is blocked for logging in).