Closed roadbikemike closed 7 years ago
Hi, I think the problem that you describe indicates that we did not close the connection when failed to connect. A quick look into the code indicates that we actually will aways close the connection on failed connection initialization before re-throw the error.
Setting the maxIdleSessionPoolSize will not help too much on this problem as this config will not affect how many connections could be created but how many could be buffered.
I will look into you problem and give you a fix asap.
Zhen
We made some slight modifications to the test code which appears to have alleviated the issue. Basically we made the driver static along with the class constructor. Now IDriver is only being instantiated once. This may have actually solved the issue. Continuous testing appears the connection count is no longer rising. So, I think we are in good shape, but we are continuing the tests.
Okay, good to know that you solved the problem.
Let us know if you find anything else outstanding.
Alright, we have it. We had two JMeter clients running against a simple web service inserting a single record in to Neo. One client ran 100% without an error. The second client ran for a good portion of the time, but then started encountering errors. The connection count on the server was stable UNTIL the second client started to receive errors. Once that initial error started to occur, all successive calls start to fail and also start opening new connections. We were solid at 50 connections which quickly rose to well over 1000 once that client started in the error state.
This is the stack trace on the server once the client starts to get errors:
2017-02-02 11:31:18,982 [93] INFO Neo4jLogger - Neo4j.Driver.V1.ClientException: Failed to connect to the server [OUR IP]:7687 within connection timeout 5000ms
at Neo4j.Driver.Internal.Connector.SocketConnection.Init()
at Neo4j.Driver.Internal.ConnectionPool.CreateNewPooledConnection()
at Neo4j.Driver.Internal.ConnectionPool.
What about the log in neo4j? do you get any error there too when the client failed to connect?
Server side, we see:
2017-02-02 16:50:35.898+0000 ERROR [o.n.b.t.SocketTransportHandler] Fatal error occurred when handling a client connection: Connection reset by peer Connection reset by peer java.io.IOException: Connection reset by peer
We attempted catching the exception, setting the driver to null and then re-initializing. Unfortunately, that appears to abandon the previous connection and simply open a new one. This effectively puts us back in the same position. Connections climbed well over 2500 in a very short period of time.
Something else we noticed, the connections on the Neo side persist. They never seem to go away which is interesting. I let the 2300+ connections sit on the Neo server and after an hour, they were still there.
Thanks for reporting the bug.
Sounds like a leak in the connections. It seems, the client opened a connection without registering it to connection pool due to the exception. So the connection was not closed or monitored by the driver. Then the server think the client is still using it while the client does not know this connection. Therefore when you close the driver, the connection was not closed. I will check if there is such a path to make this leak possible.
But do you have any clue why your system initial got this error: Neo4j.Driver.V1.ClientException: Failed to connect to the server [OUR IP]:7687 within connection timeout 5000ms
?
Good question. If we have two separate clients running JMeter, one being slightly throttled and the other running at full speed, the full speed JMeter will start to fail with timeouts while the other will continue to run without error. So, again it seems as if the driver is becoming unstable if pushed too hard. Once in this state, the driver is not healthy from there on out.
zhenlineo, any luck on diagnosing this issue? This is becoming more and more critical for us. Is there another method to escalate the issue I should take?
Sorry I was tired up by other things. I will give you a update tomorrow. I get some ideas that I would like to explore.
Hi, I found the problem with connection timeout was caused by tls connections. if you could turn off tls, you should be able to establish connections really fast.
If you still receive any connection timeout 5000ms, then enlarge the timeout or set it to -1 to enforce waiting there. I tried with 5000 threads with -1 connection timeout (infinite waiting), it finished without any error.
I will try to figure out the tls and connection problems.
I will try this today zhenlineo. Keep in mind though, establishing connections was not an issue. The driver would get in to a state where errors were encountered and could seemingly not get out of that state. I did not realize TLS was enabled by default, so we will disable and test. Originally, TLS was our desired state, but that is no longer the case. Either way, figuring out the issues with TLS enabled is ideal as you noted. Really appreciate everyone's help on this!
Disabling encryption completely alleviated the issue! We were up over 1000 posts per second without a single issue via our web API and connections were stable using bolt! We enabled bolt+routing which was good and our preferred method. However, when bolt+routing is enabled and we point to the leader, there is no issue. If we point to one of the read nodes, we get:
"Neo4j.Driver.V1.SessionExpiredException: Server at bolt+routing://[some_ip]:7687/ no longer accepts writes\r\n at Neo4j.Driver.Internal.Connector.SocketConnection.AssertNoServerFailure()\r\n at Neo4j.Driver.Internal.Result.ResultBuilder.NextRecord()\r\n at Neo4j.Driver.Internal.Result.RecordSet.
This is odd, because the IP it is attempting to route to is the leader. Likewise, if we point to the load balancer in front of our Neo nodes, we get the same error. Any ideas?
After doing some routing tests with the driver, it appears to NOT be routing properly. We are formulating our URL with bolt+routing:\ which should be correct. Doing a packet capture, it appears the driver continues to use the load balancer and not get handed off to the LEADER. We can see in the initial call where it gets the list back:
.)...$CALL dbms.cluster.routing.getServers......?...0.p...result_available_after..fields..ttl.servers.....q...,...addresses..10.105.2.4:7687.role.WRITE..addresses..10.105.2.5:7687.10.105.2.6:7687.role.READ..addresses..10.105.2.4:7687.10.105.2.6:7687.10.105.2.5:7687.role.ROUTE...".p...result_consumed_after..type.r...........p........;CREATE(removed).....?...$.p...result_available_after..fields....V.p..stats..labels-added..nodes-created..properties-set...result_consumed_after..type.w...........p...
In the uploaded screen shot from Wireshark, you can see where the client (10.5.0.34) initiates with the load balancer (10.105.3.230) and never gets handed off to the leader (there are two tests there, one in white, the other in yellow). It continues to communicate with the load balancer over and over. When we tested with the Python driver, initial communication was made to the LB to determine the leader. This was then passed back to the client and further communication would take place between the client and leader directly. I am assuming the .NET driver should do the same?
Oh, make sure that you are using the 1.1.1 release, I fixed a bug might be the problem that you are describing.
We are using .NET driver 1.1. I am upgrading to 1.1.1 through Visual Studio now...
Your assumption is exactly correct. We have three CORE nodes with one being the leader (obviously). The load balancer sits in front of these three nodes. If we point the client directly at the leader, we get no failures. If we point the client at one of the other non-leader CORE nodes directly, we get the error above. If we point at the load balancer, we occasionally get the "Server at bolt+routing://10.105.2.4:7687/ no longer accepts writes" on the client. In this case, 10.105.2.4 is indeed the leader, so that should not be an issue.
We upgraded to 1.1.1 and YES! Completely different behavior on the network side and we see the routing working as intended. I have been running tests and no failures going through the load balancer. Going to continue testing. You guys rock!
@zhenlineo
I'm getting the same exception with the 1.1.1 package when working with neo4j community edition 3.1
the code:
Parallel.ForEach(followers, follower =>
{
using (var session = driver.Session()) // exception
{
session.Run(...).Consume()
}
});
the exception
Failed to connect to the server 10.0.0.212:7687 within connection timeout 5000ms
at Neo4j.Driver.Internal.Connector.SocketConnection.Init() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\Connector\SocketConnection.cs:line 82
at Neo4j.Driver.Internal.ConnectionPool.CreateNewPooledConnection() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 99
at Neo4j.Driver.Internal.ConnectionPool.<Acquire>b__19_0() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 126
at Neo4j.Driver.Internal.LoggerBase.TryExecute[T](Func`1 func) in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\LoggerBase.cs:line 54
at Neo4j.Driver.Internal.ConnectionPool.Acquire() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 111
at Neo4j.Driver.Internal.DirectDriver.NewSession(AccessMode mode) in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\DirectDriver.cs:line 41
at Neo4j.Driver.Internal.Routing.BaseDriver.Session() in Z:\BuildAgent\work\9ae188c903d07190\Neo4j.Driver\Neo4j.Driver\Internal\Routing\BaseDriver.cs:line 49
at ConsoleApplication111.Program.<>c__DisplayClass2_2.<Main>b__5(Tuple`2 follower) in C:\Users\Avi\Documents\Visual Studio 2015\Projects\ConsoleApplication111\ConsoleApplication111\Program.cs:line 46
at System.Threading.Tasks.Parallel.<>c__DisplayClass30_0`2.<ForEachWorker>b__0(Int32 i)
at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
@AviAvni How many concurrent thread you have to use the driver? Are you also use this driver in a .net ASP based app? Do you have ssl turned on? Will you still get the error if you turn off ssl encryption?
@AviAvni Can you also try this snapshot Neo4j.Driver.1.1.2-SNAPSHOT-1702151233.nupkg
in Mygut to see if you still have the error too?
Do not use any snapshot driver in real production
@zhenlineo concurrent thread I think 4 now i'm using console app for poc after that i'll use this in asp.net ssl is truned on (also we have another problem other laptop can't connect when ssl is on only off) I'll try the snapshot
@AviAvni Hi, in my console test environment, I have no issue to connect server with ~1000 threads and ssl on. I only see this error when the thread count goes to 5000. So I am wondering if your server/client setting is wrong.
@zhenlineo I moved to the snapshot package and with encryption the same error without encryption no error
@zhenlineo akso with no encryption it happend I set the WithConnectionTimeout to 30 sec and still there is connection timeout of 30 sec
Could you try to see if this you could finish this raw socket connection in 30s in your APP env? https://github.com/neo4j/neo4j-dotnet-driver/blob/1.2/Neo4j.Driver/Neo4j.Driver/Internal/Connector/TcpSocketClient.cs#L48
I can hardly understand why it failed to connect as the code path is just raw system socket conn call.
@zhenlineo I downloaded the 1.2 branch code and run this code and I changed the method you linked to
public async Task ConnectAsync(Uri uri, bool useTls)
{
Console.WriteLine("---");
await _client.ConnectAsync(uri.Host, uri.Port).ConfigureAwait(false);
if (!useTls)
{
_stream = _client.GetStream();
}
else
{
try
{
_stream = new SslStream(_client.GetStream(), true,
(sender, certificate, chain, errors) =>
_encryptionManager.TrustStrategy.ValidateServerCertificate(uri, certificate, errors));
await ((SslStream) _stream)
.AuthenticateAsClientAsync(uri.Host, null, Tls12, false).ConfigureAwait(false);
}
catch (Exception e)
{
throw new SecurityException($"Failed to establish encrypted connection with server {uri}.", e);
}
}
Console.WriteLine("+++");
}
using Neo4j.Driver.V1;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var driver = GraphDatabase.Driver("bolt://10.0.0.212", AuthTokens.Basic("neo4j", "123456"), Config.Builder.WithConnectionTimeout(TimeSpan.FromSeconds(30)).ToConfig());
Parallel.For(0, 20000, i =>
{
using (var session = driver.Session())
{
session.Run("MATCH (u:User) RETURN u").Consume();
}
});
}
}
}
---
---
---
---
+++
+++
+++
+++
---
+++
---
+++
---
---
---
+++
+++
+++
---
---
---
---
---
---
+++
+++
+++
+++
+++
+++
Unhandled Exception: System.AggregateException: One or more errors occurred. ---> Neo4j.Driver.V1.ServiceUnavailableException: Connection with the server breaks due to IOException: Failed to connect to the server 10.0.0.212:7687 within connection timeout 30000ms ---> System.IO.IOException: Failed to connect to the server 10.0.0.212:7687 within connection timeout 30000ms
at Neo4j.Driver.Internal.Connector.SocketConnection.Init() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Connector\SocketConnection.cs:line 78
--- End of inner exception stack trace ---
at Neo4j.Driver.Internal.Connector.SocketConnection.Init() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Connector\SocketConnection.cs:line 83
at Neo4j.Driver.Internal.Connector.PooledConnection.Init() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Connector\PooledConnection.cs:line 44
at Neo4j.Driver.Internal.ConnectionPool.CreateNewPooledConnection() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 99
at Neo4j.Driver.Internal.ConnectionPool.<Acquire>b__19_0() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 126
at Neo4j.Driver.Internal.LoggerBase.TryExecute[T](Func`1 func) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\LoggerBase.cs:line 54
at Neo4j.Driver.Internal.ConnectionPool.Acquire() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\ConnectionPool.cs:line 111
at Neo4j.Driver.Internal.DirectDriver.NewSession(AccessMode mode) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\DirectDriver.cs:line 41
at Neo4j.Driver.Internal.Routing.BaseDriver.Session(AccessMode mode) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Routing\BaseDriver.cs:line 59
at Neo4j.Driver.Internal.Routing.BaseDriver.Session() in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\Neo4j.Driver\Internal\Routing\BaseDriver.cs:line 49
at ConsoleApplication1.Program.<>c__DisplayClass0_0.<Main>b__0(Int32 i) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\ConsoleApplication1\Program.cs:line 18
at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
at System.Threading.Tasks.Task.InnerInvoke()
at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.Wait()
at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
at System.Threading.Tasks.Parallel.For(Int32 fromInclusive, Int32 toExclusive, Action`1 body)
at ConsoleApplication1.Program.Main(String[] args) in C:\Users\Avi\Desktop\neo4j-dotnet-driver-1.2\Neo4j.Driver\ConsoleApplication1\Program.cs:line 16
@AviAvni Hi, I did a similar test on the driver and it shows me that it is just slow to establish socket connections. I think it might related to how .Net manages async resources.
Try to create the driver with new Config{ ConnectionTimeout=TimeSpan.FromMilliseconds(-1)}
to wait infinitely when establishing connections. I get no error if I wait infinitely.
@zhenlineo why you not expose async api?
@AviAvni haven't got the chance to build the async API and sadly the .net core only provides a async tcp client :(
I am closing this issue as the original issue (too many open connections) was already addressed.
Regarding timing out on connection establishment, a new async API which we are currently working on should have the timing out problem solved. An alpha release with full async API support for direct driver has been released a few weeks ago. You are most welcome to try it out and give us your valuable feedback. Read more about the new alpha release here
For any further issues, please be free to open new issues.
Thanks, Zhen
@zhenlineo I just do a check for the new 1.5 package in a side console app in the sync api the error still occure in the async api its work properly
when we can use the 1.5 async api in production?
@AviAvni Glad to know that async solved your problem. The 1.5 GA release is scheduled for this fall.
@zhenlineo fall is little late for us what we can do to have workaround this with current driver?
@AviAvni Could you restate your issue in a new issue? It would also help if you have some code to reproduce your problem.
It is a bit unclear right now if your problem was Failed to establish conn in 5000ms
or too many open connections or takes too long to establish connections?
Thanks a lot. Zhen
Hello. I'm having a connection issue with the Python driver. How can I see the number of open connections from the server side?
We are having some major issues with the .NET Neo driver (v1.1). When running tests from a console app, the connections are managed as expected. However, the same code on IIS and exercised via JMeter is completely different. When the code hits an exception, in our case "Neo4j.Driver.V1.ClientException: Failed to connect to the server [some ip]:7687 within connection timeout 5000ms" the connection is not closed and they keep on rising and rising up in to the 1000's.
I have tried setting MaxIdleSessionPoolSize anywhere between 10 and 100 with no difference.
Here is my code:
string username = ConfigurationManager.AppSettings["NeoUser"]; ; string password = ConfigurationManager.AppSettings["NeoPass"]; ; string boltUri = ConfigurationManager.AppSettings["boltUri"]; string PooledConns = ConfigurationManager.AppSettings["PooledCons"];
var config = new Config { }; config.MaxIdleSessionPoolSize = int.Parse(PooledConns);
}