neo4j / neo4j-dotnet-driver

Neo4j Bolt driver for .NET
Apache License 2.0
232 stars 68 forks source link

Excessive Open Connections #141

Closed roadbikemike closed 7 years ago

roadbikemike commented 7 years ago

We are having some major issues with the .NET Neo driver (v1.1). When running tests from a console app, the connections are managed as expected. However, the same code on IIS and exercised via JMeter is completely different. When the code hits an exception, in our case "Neo4j.Driver.V1.ClientException: Failed to connect to the server [some ip]:7687 within connection timeout 5000ms" the connection is not closed and they keep on rising and rising up in to the 1000's.

I have tried setting MaxIdleSessionPoolSize anywhere between 10 and 100 with no difference.

Here is my code:

string username = ConfigurationManager.AppSettings["NeoUser"]; ; string password = ConfigurationManager.AppSettings["NeoPass"]; ; string boltUri = ConfigurationManager.AppSettings["boltUri"]; string PooledConns = ConfigurationManager.AppSettings["PooledCons"];

var config = new Config { }; config.MaxIdleSessionPoolSize = int.Parse(PooledConns);

try
{
    using (IDriver driver = GraphDatabase.Driver(boltUri, AuthTokens.Basic(username, password), config))
    {
        using (var session = driver.Session())
        {

            var setGUID = Guid.NewGuid().ToString();

            IDictionary<string, object> param = new Dictionary<string, object>();
            param.Add("0", "User-" + setGUID);
            param.Add("1", "Display-" + setGUID);
            param.Add("2", "Name=" + setGUID);

            session.Run("CREATE(a: User { id: {0}, displayName: {1}, identity: {2}})", param).Consume();

            return Request.CreateResponse(HttpStatusCode.OK, "User successfully added!");

        }
    }
}
catch (Exception ex)
{
    Logger.Write("---------------------------------------");
    Logger.Write(ex.ToString());

    return Request.CreateResponse(HttpStatusCode.InternalServerError, ex.ToString());
}

}

zhenlineo commented 7 years ago

Hi, I think the problem that you describe indicates that we did not close the connection when failed to connect. A quick look into the code indicates that we actually will aways close the connection on failed connection initialization before re-throw the error.

Setting the maxIdleSessionPoolSize will not help too much on this problem as this config will not affect how many connections could be created but how many could be buffered.

I will look into you problem and give you a fix asap.

Zhen

roadbikemike commented 7 years ago

We made some slight modifications to the test code which appears to have alleviated the issue. Basically we made the driver static along with the class constructor. Now IDriver is only being instantiated once. This may have actually solved the issue. Continuous testing appears the connection count is no longer rising. So, I think we are in good shape, but we are continuing the tests.

zhenlineo commented 7 years ago

Okay, good to know that you solved the problem.

Let us know if you find anything else outstanding.

roadbikemike commented 7 years ago

Alright, we have it. We had two JMeter clients running against a simple web service inserting a single record in to Neo. One client ran 100% without an error. The second client ran for a good portion of the time, but then started encountering errors. The connection count on the server was stable UNTIL the second client started to receive errors. Once that initial error started to occur, all successive calls start to fail and also start opening new connections. We were solid at 50 connections which quickly rose to well over 1000 once that client started in the error state.

roadbikemike commented 7 years ago

This is the stack trace on the server once the client starts to get errors:

2017-02-02 11:31:18,982 [93] INFO Neo4jLogger - Neo4j.Driver.V1.ClientException: Failed to connect to the server [OUR IP]:7687 within connection timeout 5000ms at Neo4j.Driver.Internal.Connector.SocketConnection.Init() at Neo4j.Driver.Internal.ConnectionPool.CreateNewPooledConnection() at Neo4j.Driver.Internal.ConnectionPool.b__18_0() at Neo4j.Driver.Internal.LoggerBase.TryExecute[T](Func`1 func) at Neo4j.Driver.Internal.DirectDriver.NewSession(AccessMode mode) at Neo4j.Driver.Internal.Routing.BaseDriver.Session() at Neo4jWebService.Controllers.Neo4jController.CreateNeoUser()

zhenlineo commented 7 years ago

What about the log in neo4j? do you get any error there too when the client failed to connect?

roadbikemike commented 7 years ago

Server side, we see:

2017-02-02 16:50:35.898+0000 ERROR [o.n.b.t.SocketTransportHandler] Fatal error occurred when handling a client connection: Connection reset by peer Connection reset by peer java.io.IOException: Connection reset by peer

roadbikemike commented 7 years ago

We attempted catching the exception, setting the driver to null and then re-initializing. Unfortunately, that appears to abandon the previous connection and simply open a new one. This effectively puts us back in the same position. Connections climbed well over 2500 in a very short period of time.

roadbikemike commented 7 years ago

Something else we noticed, the connections on the Neo side persist. They never seem to go away which is interesting. I let the 2300+ connections sit on the Neo server and after an hour, they were still there.

zhenlineo commented 7 years ago

Thanks for reporting the bug.

Sounds like a leak in the connections. It seems, the client opened a connection without registering it to connection pool due to the exception. So the connection was not closed or monitored by the driver. Then the server think the client is still using it while the client does not know this connection. Therefore when you close the driver, the connection was not closed. I will check if there is such a path to make this leak possible.

But do you have any clue why your system initial got this error: Neo4j.Driver.V1.ClientException: Failed to connect to the server [OUR IP]:7687 within connection timeout 5000ms?

roadbikemike commented 7 years ago

Good question. If we have two separate clients running JMeter, one being slightly throttled and the other running at full speed, the full speed JMeter will start to fail with timeouts while the other will continue to run without error. So, again it seems as if the driver is becoming unstable if pushed too hard. Once in this state, the driver is not healthy from there on out.

roadbikemike commented 7 years ago

zhenlineo, any luck on diagnosing this issue? This is becoming more and more critical for us. Is there another method to escalate the issue I should take?

zhenlineo commented 7 years ago

Sorry I was tired up by other things. I will give you a update tomorrow. I get some ideas that I would like to explore.

zhenlineo commented 7 years ago

Hi, I found the problem with connection timeout was caused by tls connections. if you could turn off tls, you should be able to establish connections really fast.

If you still receive any connection timeout 5000ms, then enlarge the timeout or set it to -1 to enforce waiting there. I tried with 5000 threads with -1 connection timeout (infinite waiting), it finished without any error.

I will try to figure out the tls and connection problems.

roadbikemike commented 7 years ago

I will try this today zhenlineo. Keep in mind though, establishing connections was not an issue. The driver would get in to a state where errors were encountered and could seemingly not get out of that state. I did not realize TLS was enabled by default, so we will disable and test. Originally, TLS was our desired state, but that is no longer the case. Either way, figuring out the issues with TLS enabled is ideal as you noted. Really appreciate everyone's help on this!

roadbikemike commented 7 years ago

Disabling encryption completely alleviated the issue! We were up over 1000 posts per second without a single issue via our web API and connections were stable using bolt! We enabled bolt+routing which was good and our preferred method. However, when bolt+routing is enabled and we point to the leader, there is no issue. If we point to one of the read nodes, we get:

"Neo4j.Driver.V1.SessionExpiredException: Server at bolt+routing://[some_ip]:7687/ no longer accepts writes\r\n at Neo4j.Driver.Internal.Connector.SocketConnection.AssertNoServerFailure()\r\n at Neo4j.Driver.Internal.Result.ResultBuilder.NextRecord()\r\n at Neo4j.Driver.Internal.Result.RecordSet.d__7.MoveNext()\r\n at Neo4j.Driver.Internal.Result.StatementResult.Consume()\r\n at Neo4jWebService.Controllers.Neo4jController.CreateNeoUser()"

This is odd, because the IP it is attempting to route to is the leader. Likewise, if we point to the load balancer in front of our Neo nodes, we get the same error. Any ideas?

roadbikemike commented 7 years ago

After doing some routing tests with the driver, it appears to NOT be routing properly. We are formulating our URL with bolt+routing:\ which should be correct. Doing a packet capture, it appears the driver continues to use the load balancer and not get handed off to the LEADER. We can see in the initial call where it gets the list back:

.)...$CALL dbms.cluster.routing.getServers......?...0.p...result_available_after..fields..ttl.servers.....q...,...addresses..10.105.2.4:7687.role.WRITE..addresses..10.105.2.5:7687.10.105.2.6:7687.role.READ..addresses..10.105.2.4:7687.10.105.2.6:7687.10.105.2.5:7687.role.ROUTE...".p...result_consumed_after..type.r...........p........;CREATE(removed).....?...$.p...result_available_after..fields....V.p..stats..labels-added..nodes-created..properties-set...result_consumed_after..type.w...........p...

neo_driver

In the uploaded screen shot from Wireshark, you can see where the client (10.5.0.34) initiates with the load balancer (10.105.3.230) and never gets handed off to the leader (there are two tests there, one in white, the other in yellow). It continues to communicate with the load balancer over and over. When we tested with the Python driver, initial communication was made to the LB to determine the leader. This was then passed back to the client and further communication would take place between the client and leader directly. I am assuming the .NET driver should do the same?

zhenlineo commented 7 years ago

Oh, make sure that you are using the 1.1.1 release, I fixed a bug might be the problem that you are describing.

roadbikemike commented 7 years ago

We are using .NET driver 1.1. I am upgrading to 1.1.1 through Visual Studio now...

Your assumption is exactly correct. We have three CORE nodes with one being the leader (obviously). The load balancer sits in front of these three nodes. If we point the client directly at the leader, we get no failures. If we point the client at one of the other non-leader CORE nodes directly, we get the error above. If we point at the load balancer, we occasionally get the "Server at bolt+routing://10.105.2.4:7687/ no longer accepts writes" on the client. In this case, 10.105.2.4 is indeed the leader, so that should not be an issue.

roadbikemike commented 7 years ago

We upgraded to 1.1.1 and YES! Completely different behavior on the network side and we see the routing working as intended. I have been running tests and no failures going through the load balancer. Going to continue testing. You guys rock!

AviAvni commented 7 years ago

@zhenlineo

I'm getting the same exception with the 1.1.1 package when working with neo4j community edition 3.1

the code:

Parallel.ForEach(followers, follower =>
{
    using (var session = driver.Session()) // exception
    {
        session.Run(...).Consume()
    }
});

the exception

Failed to connect to the server 10.0.0.212:7687 within connection timeout 5000ms
   at Neo4j.Driver.Internal.Connector.SocketConnection.Init() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\Connector\SocketConnection.cs:line 82
   at Neo4j.Driver.Internal.ConnectionPool.CreateNewPooledConnection() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 99
   at Neo4j.Driver.Internal.ConnectionPool.<Acquire>b__19_0() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 126
   at Neo4j.Driver.Internal.LoggerBase.TryExecute[T](Func`1 func) in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\LoggerBase.cs:line 54
   at Neo4j.Driver.Internal.ConnectionPool.Acquire() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 111
   at Neo4j.Driver.Internal.DirectDriver.NewSession(AccessMode mode) in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\DirectDriver.cs:line 41
   at Neo4j.Driver.Internal.Routing.BaseDriver.Session() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\Routing\BaseDriver.cs:line 49
   at ConsoleApplication111.Program.<>c__DisplayClass2_2.<Main>b__5(Tuple`2 follower) in C:\Users\Avi\Documents\Visual Studio 2015\Projects\ConsoleApplication111\ConsoleApplication111\Program.cs:line 46
   at System.Threading.Tasks.Parallel.<>c__DisplayClass30_0`2.<ForEachWorker>b__0(Int32 i)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
zhenlineo commented 7 years ago

@AviAvni How many concurrent thread you have to use the driver? Are you also use this driver in a .net ASP based app? Do you have ssl turned on? Will you still get the error if you turn off ssl encryption?

zhenlineo commented 7 years ago

@AviAvni Can you also try this snapshot Neo4j.Driver.1.1.2-SNAPSHOT-1702151233.nupkg in Mygut to see if you still have the error too?

Do not use any snapshot driver in real production

AviAvni commented 7 years ago

@zhenlineo concurrent thread I think 4 now i'm using console app for poc after that i'll use this in asp.net ssl is truned on (also we have another problem other laptop can't connect when ssl is on only off) I'll try the snapshot

zhenlineo commented 7 years ago

@AviAvni Hi, in my console test environment, I have no issue to connect server with ~1000 threads and ssl on. I only see this error when the thread count goes to 5000. So I am wondering if your server/client setting is wrong.

AviAvni commented 7 years ago

@zhenlineo I moved to the snapshot package and with encryption the same error without encryption no error

AviAvni commented 7 years ago

@zhenlineo akso with no encryption it happend I set the WithConnectionTimeout to 30 sec and still there is connection timeout of 30 sec

zhenlineo commented 7 years ago

Could you try to see if this you could finish this raw socket connection in 30s in your APP env? https://github.com/neo4j/neo4j-dotnet-driver/blob/1.2/Neo4j.Driver/Neo4j.Driver/Internal/Connector/TcpSocketClient.cs#L48

I can hardly understand why it failed to connect as the code path is just raw system socket conn call.

AviAvni commented 7 years ago

@zhenlineo I downloaded the 1.2 branch code and run this code and I changed the method you linked to

public async Task ConnectAsync(Uri uri, bool useTls)
        {
            Console.WriteLine("---");
            await _client.ConnectAsync(uri.Host, uri.Port).ConfigureAwait(false);

            if (!useTls)
            {
                _stream = _client.GetStream();
            }
            else
            {
                try
                {
                    _stream = new SslStream(_client.GetStream(), true,
                        (sender, certificate, chain, errors) =>
                            _encryptionManager.TrustStrategy.ValidateServerCertificate(uri, certificate, errors));

                    await ((SslStream) _stream)
                        .AuthenticateAsClientAsync(uri.Host, null, Tls12, false).ConfigureAwait(false);
                }
                catch (Exception e)
                {
                    throw new SecurityException($"Failed to establish encrypted connection with server {uri}.", e);
                }
            }

            Console.WriteLine("+++");
        }
using Neo4j.Driver.V1;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            var driver = GraphDatabase.Driver("bolt://10.0.0.212", AuthTokens.Basic("neo4j", "123456"), Config.Builder.WithConnectionTimeout(TimeSpan.FromSeconds(30)).ToConfig());

            Parallel.For(0, 20000, i =>
            {
                using (var session = driver.Session())
                {
                    session.Run("MATCH (u:User) RETURN u").Consume();
                }
            });
        }
    }
}
---
---
---
---
+++
+++
+++
+++
---
+++
---
+++
---
---
---
+++
+++
+++
---
---
---
---
---
---
+++
+++
+++
+++
+++
+++

Unhandled Exception: System.AggregateException: One or more errors occurred. ---> Neo4j.Driver.V1.ServiceUnavailableException: Connection with the server breaks due to IOException: Failed to connect to the server 10.0.0.212:7687 within connection timeout 30000ms ---> System.IO.IOException: Failed to connect to the server 10.0.0.212:7687 within connection timeout 30000ms
   at Neo4j.Driver.Internal.Connector.SocketConnection.Init() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Connector\SocketConnection.cs:line 78
   --- End of inner exception stack trace ---
   at Neo4j.Driver.Internal.Connector.SocketConnection.Init() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Connector\SocketConnection.cs:line 83
   at Neo4j.Driver.Internal.Connector.PooledConnection.Init() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Connector\PooledConnection.cs:line 44
   at Neo4j.Driver.Internal.ConnectionPool.CreateNewPooledConnection() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 99
   at Neo4j.Driver.Internal.ConnectionPool.<Acquire>b__19_0() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 126
   at Neo4j.Driver.Internal.LoggerBase.TryExecute[T](Func`1 func) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\LoggerBase.cs:line 54
   at Neo4j.Driver.Internal.ConnectionPool.Acquire() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 111
   at Neo4j.Driver.Internal.DirectDriver.NewSession(AccessMode mode) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\DirectDriver.cs:line 41
   at Neo4j.Driver.Internal.Routing.BaseDriver.Session(AccessMode mode) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Routing\BaseDriver.cs:line 59
   at Neo4j.Driver.Internal.Routing.BaseDriver.Session() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Routing\BaseDriver.cs:line 49
   at ConsoleApplication1.Program.<>c__DisplayClass0_0.<Main>b__0(Int32 i) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\ConsoleApplication1\Program.cs:line 18
   at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.For(Int32 fromInclusive, Int32 toExclusive, Action`1 body)
   at ConsoleApplication1.Program.Main(String[] args) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\ConsoleApplication1\Program.cs:line 16
zhenlineo commented 7 years ago

@AviAvni Hi, I did a similar test on the driver and it shows me that it is just slow to establish socket connections. I think it might related to how .Net manages async resources.

Try to create the driver with new Config{ ConnectionTimeout=TimeSpan.FromMilliseconds(-1)} to wait infinitely when establishing connections. I get no error if I wait infinitely.

AviAvni commented 7 years ago

@zhenlineo why you not expose async api?

zhenlineo commented 7 years ago

@AviAvni haven't got the chance to build the async API and sadly the .net core only provides a async tcp client :(

zhenlineo commented 7 years ago

I am closing this issue as the original issue (too many open connections) was already addressed.

Regarding timing out on connection establishment, a new async API which we are currently working on should have the timing out problem solved. An alpha release with full async API support for direct driver has been released a few weeks ago. You are most welcome to try it out and give us your valuable feedback. Read more about the new alpha release here

For any further issues, please be free to open new issues.

Thanks, Zhen

AviAvni commented 7 years ago

@zhenlineo I just do a check for the new 1.5 package in a side console app in the sync api the error still occure in the async api its work properly

when we can use the 1.5 async api in production?

zhenlineo commented 7 years ago

@AviAvni Glad to know that async solved your problem. The 1.5 GA release is scheduled for this fall.

AviAvni commented 7 years ago

@zhenlineo fall is little late for us what we can do to have workaround this with current driver?

zhenlineo commented 7 years ago

@AviAvni Could you restate your issue in a new issue? It would also help if you have some code to reproduce your problem.

It is a bit unclear right now if your problem was Failed to establish conn in 5000ms or too many open connections or takes too long to establish connections?

Thanks a lot. Zhen

g3rd commented 7 years ago

Hello. I'm having a connection issue with the Python driver. How can I see the number of open connections from the server side?