npgsql / npgsql

Npgsql is the .NET data provider for PostgreSQL.
http://www.npgsql.org
PostgreSQL License
3.29k stars 822 forks source link

Npgsql connection not closed properly when existing connection forcibly closed by remote host #1370

Closed tcima7 closed 7 years ago

tcima7 commented 7 years ago

We are using Npgsql to connect to a Redshift database. We've had a problem lately where some queries are taking a very long time to run (performance issue, unrelated to Npgsql) and are eventually failing with the following error:

Npgsql.NpgsqlException: Exception while reading from stream ---> System.IO.IOException: Unable to read data from the transport connection: **An existing connection was forcibly closed by the remote host.** ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)
   at System.Net.Security._SslStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at Npgsql.ReadBuffer.Ensure(Int32 count, Boolean dontBreakOnTimeouts)
   --- End of inner exception stack trace ---
   at Npgsql.ReadBuffer.Ensure(Int32 count, Boolean dontBreakOnTimeouts)
   at Npgsql.NpgsqlConnector.DoReadMessage(DataRowLoadingMode dataRowLoadingMode, Boolean isPrependedMessage)
   at Npgsql.NpgsqlConnector.ReadMessageWithPrepended(DataRowLoadingMode dataRowLoadingMode)
   at Npgsql.NpgsqlConnector.ReadExpecting[T]()
   at Npgsql.NpgsqlDataReader.NextResultInternal()
   at Npgsql.NpgsqlDataReader.NextResult()
   at Npgsql.NpgsqlCommand.Execute(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReaderInternal(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.FillInternal(DataSet dataset, DataTable[] datatables, Int32 startRecord, Int32 maxRecords, String srcTable, IDbCommand command, CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.Fill(DataSet dataSet, Int32 startRecord, Int32 maxRecords, String srcTable, IDbCommand command, CommandBehavior behavior)
   at System.Data.Common.DbDataAdapter.Fill(DataSet dataSet)

A short while later, when Npgsql attempts to close the connection, the following error occurs:

Exception while writing to stream,"innerexceptionmessage":Unable to write data to the transport connection: **An existing connection was forcibly closed by the remote host**.,"exceptionstacktrace":   at Npgsql.WriteBuffer.Flush()
   at Npgsql.NpgsqlConnector.SendBuffer()
   at Npgsql.NpgsqlConnector.SendMessage(FrontendMessage msg)
   at Npgsql.NpgsqlConnector.Close()

As Npgsql is unable to close connections due to the above sequence of errors, the connection pool eventually becomes exhausted and we have to restart our application in order to recover it.

As a short term measure, we've disabled the queries that lead to these issues but we'd expect Npgsql to be able to release broken connections properly in the above scenario.

Please let me know if you require any further information to diagnose this issue. Thank you!

roji commented 7 years ago

Thanks for reporting. Looking at the code, NpgsqlConnector.Close() calls SendMessage inside a try/catch, and properly handles any exceptions by logging and continuing. So a failure during close definitely should cause any sort of connection pool leak.

Which version of Npgsql are you using? Where do you see the Close() exception, is it in the logs or is your application seeing it? It seems that the stack trace is partial - I can't see the caller of NpgsqlConnector.Close() (presumably called by the pool), can you please send the entire exception including any inner exceptions?

I'll definitely look into this but I need a few more details about exactly what's happening.

tcima7 commented 7 years ago

The issue was observed in Npgsql 3.1.8. We recently upgraded to Npgsql 3.1.9 and haven't seen the issue since but we also disabled the problem queries around the same time.

The stack traces pasted above are complete (i.e. there is nothing else I'm able to find) and were extracted from our log files. The stack trace ending with Npgsql.NpgsqlConnector.Close() was logged by the Npgsql.NpgsqlConnector logger.

I'm not sure if this is helpful but here is a bit more information about our Redshift connection:

Hopefully this helps. If the issue comes up again and/or I find a way of pulling more detailed logs, I'll post a followup comment with details. Thanks for investigating!

roji commented 7 years ago

@tcima7 thanks for the extra info, but something here doesn't completely make sense...

The first exception occurs when executing a command - it appears that Redshift forcibly closes the connection, which is strange on its own (I'm not aware of a long-running query causing such a remote close, but I don't know Redshift very well). At that point, since an I/O exception occurs, Npgsql internally "breaks" the connection, which closes the socket and puts the connector in state Broken.

The second exception stack trace shows NpgsqlConnector.Close() (although who calls isn't shown in your stack trace). Now, this method checks the state of the connector before attempting to send the PostgreSQL Terminate message; if the connector is in state Broken, the method does nothing at all. So it's hard to see how the two exceptions could happen on the same connection.

Another odd thing is that if an I/O exception happens while trying to send the Terminate message in NpgsqlConnector.Close(), you should be seeing the log message Exception while closing connector.

One way to understand what's going would be to turn on Npgsql logging with level debug, and track the connector ID of each message. This would allow us to understand exactly what is happening by connector. Another thing that would ideal is if you could provide a code sample, against PostgreSQL (rather than Redshift), which reproduces the issue - but I realize that could be some work.

tcima7 commented 7 years ago

Thanks for following up. I did a little bit more digging and I noticed something interesting. Ever since we upgraded from Npgsql 2.0.14.3 to Npgsql 3.1.7 a few months ago, we've been getting the following error consistently:

fullmessage: Exception while closing connector
logger: Npgsql.NpgsqlConnector
detail: {"connectorid":14274,"exceptiontype":Npgsql.NpgsqlException,"exceptionmessage":Exception while writing to stream,"innerexceptionmessage":Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,"exceptionstacktrace":   at Npgsql.WriteBuffer.Flush()
   at Npgsql.NpgsqlConnector.SendBuffer()
   at Npgsql.NpgsqlConnector.SendMessage(FrontendMessage msg)
   at Npgsql.NpgsqlConnector.Close(),"log4nethostname":na1-cpmweb03}

The error usually happens within a minute of executing a query. The important thing to note is that the closing connection error occurs after the server forcibly closes the connection. Perhaps we can focus the investigation around this particular server error and what could have changed in the Npgsql code from earlier versions to cause this.

I'll post another comment if I'm able to find out more. Thanks again for looking into this.

roji commented 7 years ago

Thanks. I can now see that the log message is Exception while closing connector as expected, that makes more sense.

This error by itself isn't necessarily a problem... You connection pool may be pruning idle connections (i.e. physically closing connections not in use), and while trying to do so encounters some I/O error. Since you seem to be getting some unexplained/random I/O errors from Redshift, this may just be another instance of that.

However, none of this should lead to connection pool leaks/exhaustion. I'd carefully track down exactly what happened with a given connector - once you see such an exception, search the logs for that connectorId and see what happened with it. Is it possible that the leak is happening because of something completely unrelated? In other words, if you disable the queries triggering the exceptions, does the leak go away?

tcima7 commented 7 years ago

We had a chance to turn on DEBUG logging for Npgsql and traced some of the connections. The particular queries that lead to the broken connection seem to vary, but the last two messages are consistent. Here's an example of all the log messages for a specific connectorid:

timestamp: December 9th 2016, 22:48:35.395 fullmessage: Opened connection to na1-icm-chr.cdmlw7sdfptl.us-west-2.redshift.amazonaws.com:5439 logger: Npgsql.NpgsqlConnector detail: {"connectorid":3889,"log4nethostname":na1-cpmweb01}

timestamp: December 9th 2016, 22:48:35.404 fullmessage: Executing statement(s): SELECT TOP 1 1 FROM memberevents logger: Npgsql.NpgsqlCommand detail: {"connectorid":3889,"log4nethostname":na1-cpmweb01}

timestamp: December 9th 2016, 22:48:35.462 fullmessage: Executing statement(s): SELECT VERSION() logger: Npgsql.NpgsqlCommand detail: {"connectorid":3889,"log4nethostname":na1-cpmweb01}

timestamp: December 9th 2016, 22:48:35.502 fullmessage: Connection closed logger: Npgsql.NpgsqlConnection detail: {"connectorid":3889,"log4nethostname":na1-cpmweb01}

timestamp: December 9th 2016, 22:53:44.000 fullmessage: Exception while closing connector detail: {"connectorid":3889,"exceptiontype":Npgsql.NpgsqlException,"exceptionmessage":Exception while writing to stream,"innerexceptionmessage":Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host.,"exceptionstacktrace": at Npgsql.WriteBuffer.Flush() at Npgsql.NpgsqlConnector.SendBuffer() at Npgsql.NpgsqlConnector.SendMessage(FrontendMessage msg) at Npgsql.NpgsqlConnector.Close(),"log4nethostname":na1-cpmweb01}

Hope this helps.

tcima7 commented 7 years ago

Did a little bit more digging around and noticed that an additional exception is written into the Windows event log following the "Exception while closing connector" message. I've copied this below. Could it be that Npgsql is not resetting the connector's "state" correctly in the scenario where the server forcibly closes the connection? Hope this helps.

Exception information: 
    Exception type: InvalidOperationException 
    Exception message: Reset() called on connector with state Executing
   at Npgsql.NpgsqlConnector.Reset()
   at Npgsql.ConnectorPool.Release(NpgsqlConnector connector)
   at Npgsql.NpgsqlConnection.ReallyClose(Boolean wasBroken)
   at Npgsql.NpgsqlConnection.Dispose(Boolean disposing)
   at System.ComponentModel.Component.Dispose()
   at VisionCritical.Chr.Data.Repositories.Redshift.RedshiftSession.Dispose(Boolean disposing) in c:\GoAgent1\pipelines\build.chr.cpm.icm\VisionCritical.Chr.Data\Repositories\Redshift\RedshiftSession.cs:line 63
   at VisionCritical.Chr.Data.Repositories.Redshift.RedshiftSession.Dispose() in c:\GoAgent1\pipelines\build.chr.cpm.icm\VisionCritical.Chr.Data\Repositories\Redshift\RedshiftSession.cs:line 51
   at Castle.MicroKernel.ComponentActivator.AbstractComponentActivator.ApplyConcerns(IEnumerable`1 steps, Object instance)
   at Castle.MicroKernel.ComponentActivator.AbstractComponentActivator.Destroy(Object instance)
   at Castle.MicroKernel.Lifestyle.AbstractLifestyleManager.Release(Object instance)
   at Castle.MicroKernel.Burden.Release()
   at Castle.Core.Internal.CollectionExtensions.ForEach[T](IEnumerable`1 items, Action`1 action)
   at Castle.MicroKernel.Lifestyle.Scoped.ScopeCache.Dispose()
   at System.Web.HttpApplication.SyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
   at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
roji commented 7 years ago

Thanks for the added info @tcima7.

First, regarding the root problem - the server forcibly closing the connection. I of course have no idea why this is happening, and I really hope Redshift provides some logging which could shed some light on this. In regular PostgreSQL a forcible connection close is always accompanied with some message in the logs which helps understand what's going on. I suspect your queries are triggering some sort of mechanism in Redshift (maybe they're taking too long or something?).

The second error Reset() called on connector with state Executing is definitely an Npgsql issue, which could also be related to the connection pool exhaustion which you've experienced. However, can you please confirm for me that this second error is happening to the same connector as the first error (you need to check the connector ID that's logged)? Ideally, you would post the log messages only for the affected connector, with both errors.

The two errors happening to the same connector would be quite odd, since the first error is about closing a physical connection, while the second error occurs when returning a connector to the pool (i.e. not physically closing it).

roji commented 7 years ago

@tcima7, will you have time to provide the missing info from above? Note that I've also released Npgsql 3.2, it's worth testing if you see the same behavior there (I don't think the issue will magically go away as I don't think I've touched related code, but it's worth making sure).

tcima7 commented 7 years ago

Thanks for following up on this issue. We actually haven't seen the problem again recently so I haven't been able to capture the requested missing info.

With regards to Npgsql 3.2, I attempted the upgrade today and am seeing a new error come up when running one of our migration scripts.

We are using the FluentMigrator library to manage DB migrations and the failing migration has the following script:

CREATE TABLE Events (id INT8 IDENTITY(1,1) NOT NULL, eventTime TIMESTAMP NOT NULL ) sortkey(eventTime);

The error that occurs is: !!! Npgsql.PostgresException (0x80004005): 42601: syntax error at or near "DISCARD" at Npgsql.NpgsqlConnector.d148.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult() at Npgsql.NpgsqlConnector.d150.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Npgsql.NpgsqlConnector.d147.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult() at Npgsql.NpgsqlConnector.d1541.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ValueTaskAwaiter1.GetResult() at Npgsql.NpgsqlDataReader.d31.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Npgsql.NpgsqlDataReader.NextResult() at Npgsql.NpgsqlCommand.d69.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ValueTaskAwaiter`1.GetResult() at Npgsql.NpgsqlCommand.d__82.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Npgsql.NpgsqlCommand.ExecuteNonQuery() at VisionCritical.Chr.DB.Migrations.Utilities.RedshiftUtils.RunCommand(String query, String connStr) in d:\Dev\RedshiftUtils.cs:line 23 at FluentMigrator.MigrationBase.GetUpExpressions(IMigrationContext context) at FluentMigrator.Runner.MigrationRunner.Up(IMigration migration) at FluentMigrator.Runner.MigrationRunner.ApplyMigrationUp(Int64 version) at FluentMigrator.Runner.MigrationRunner.MigrateUp(Boolean useAutomaticTransactionManagement) at FluentMigrator.Runner.Initialization.TaskExecutor.Execute() at FluentMigrator.Console.MigratorConsole..ctor(String[] args)

Any initial thoughts on this new error? Also, let me know if I should create a new issue thread for this error. Thanks!

roji commented 7 years ago

@tcima7, I'm happy to hear the original error isn't appearing - hopefully it's gone for good.

Can you please close this issue and open a new one for the new bug? The best way to work on it is for you to turn on PostgreSQL statement logging (log_statement=all in your postgresql.conf) and post the relevant SQL in the new issue. This would allow me to understand exactly what Npgsql is sending to the database. Don't forget to remove the statement logging after you're done since it affects performance.