vapor / postgres-nio

🐘 Non-blocking, event-driven Swift client for PostgreSQL.
https://api.vapor.codes/postgresnio/documentation/postgresnio/
MIT License
317 stars 72 forks source link

Attempted re-use of already closed connection #374

Closed GNMoseke closed 1 year ago

GNMoseke commented 1 year ago

Describe the bug

Bug has been previously described here

We are seeing a crasher from PostgresNIO.ConnectionStateMachine.closeAndCleanup(_:) where postgres-nio attempts to re-use an already closed connection. The failure happens about 80% of the time in a given unit test, and seems to stem from a quick sequence of Failed Insert -> Select -> Delete requests.

To Reproduce

The reproducer linked in the swift forum post above is a good place to start. I've also gathered logs from the (rare) success case and much more frequent case below:

Success Logs

**2023-06-05T10:27:20.488-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.103 ["psql_message": .error(PostgresNIO.PostgresBackendMessage.ErrorResponse(fields: [Code: "23503", Schema name: "public", Localized Severity: "ERROR", Line: "2465", Detail: "... violates foreign key constraint \"...\"", Constraint name: "...", Severity: "ERROR", File: "ri_triggers.c", Routine: "ri_ReportViolation", Table name: "..."])), "psql_connection_id": 3] Backend message received**

**2023-06-05T10:27:20.490-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.217 ["psql_connection_action": closeConnectionAndCleanup(PostgresNIO.ConnectionStateMachine.ConnectionAction.CleanUpContext(action: PostgresNIO.ConnectionStateMachine.ConnectionAction.CleanUpContext.Action.close, tasks: [PostgresNIO.PSQLTask.extendedQuery(PostgresNIO.ExtendedQueryContext)], error: PostgresNIO.PSQLError(backing: PostgresNIO.PSQLError.(unknown context at $129798b04).Backing), closePromise: nil)), "psql_connection_id": 3] Run action**

**2023-06-05T10:27:20.490-06:00 [DEBUG] StorageInterface PostgresChannelHandler.swift#L.480 ["psql_error": PSQLError(backing: PostgresNIO.PSQLError.(unknown context at $129798b04).Backing), "psql_connection_id": 3] Cleaning up and closing connection.**

**2023-06-05T10:27:20.490-06:00 [TRACE] StorageInterface EventLoopConnectionPool.swift#L.261  Releasing connection**

**2023-06-05T10:27:20.490-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.87 ["psql_connection_id": 3] Channel inactive.**

**2023-06-05T10:27:20.490-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.217 ["psql_connection_action": fireChannelInactive, "psql_connection_id": 3] Run action**

**2023-06-05T10:27:20.490-06:00 [DEBUG] SQL SELECT Statement of internal schema**

**2023-06-05T10:27:20.490-06:00 [DEBUG] StorageInterface EventLoopConnectionPool.swift#L.194  Pruning available connection that has closed**

**2023-06-05T10:27:20.490-06:00 [DEBUG] StorageInterface EventLoopConnectionPool.swift#L.201  No available connections on this event loop, creating a new one**

**2023-06-05T10:27:20.492-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.217 ["psql_connection_action": provideAuthenticationContext, "psql_connection_id": 4] Run action**

^ As you can see in this example, the failed insert query properly dispatches an error message to the state machine, which closes the connection, releases it, sets the channel to inactive, and then properly prunes it. The next statement, the SELECT, is then properly dispatched onto a new psql_connection_id of 4.

Failure Logs

**2023-06-05T10:39:32.522-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.103 ["psql_connection_id": 3, "psql_message": .error(PostgresNIO.PostgresBackendMessage.ErrorResponse(fields: [Detail: "Key ... is not present in table \"...\".", Severity: "ERROR", Code: "23503", Message: "insert or update on table \"...\" violates foreign key constraint \"...\"", Table name: "History", Constraint name: "...", File: "ri_triggers.c", Line: "2465", Localized Severity: "ERROR", Schema name: "public", Routine: "ri_ReportViolation"]))] Backend message received**

**2023-06-05T10:39:32.523-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.217 ["psql_connection_action": closeConnectionAndCleanup(PostgresNIO.ConnectionStateMachine.ConnectionAction.CleanUpContext(action: PostgresNIO.ConnectionStateMachine.ConnectionAction.CleanUpContext.Action.close, tasks: [PostgresNIO.PSQLTask.extendedQuery(PostgresNIO.ExtendedQueryContext)], error: PostgresNIO.PSQLError(backing: PostgresNIO.PSQLError.(unknown context at $129798b04).Backing), closePromise: nil)), "psql_connection_id": 3] Run action**

**2023-06-05T10:39:32.523-06:00 [DEBUG] StorageInterface PostgresChannelHandler.swift#L.480 ["psql_error": PSQLError(backing: PostgresNIO.PSQLError.(unknown context at $129798b04).Backing), "psql_connection_id": 3] Cleaning up and closing connection.**

**2023-06-05T10:39:32.523-06:00 [TRACE] StorageInterface EventLoopConnectionPool.swift#L.261  Releasing connection**

**2023-06-05T10:39:32.523-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.87 ["psql_connection_id": 3] Channel inactive.**

**2023-06-05T10:39:32.523-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.217 ["psql_connection_action": fireChannelInactive, "psql_connection_id": 3] Run action**

**2023-06-05T10:39:32.524-06:00 [TRACE] StorageInterface PostgresChannelHandler.swift#L.103 ["psql_message": .readyForQuery(.idle), "psql_connection_id": 3] Backend message received**

**PostgresNIO/ConnectionStateMachine.swift:924: Fatal error: How can an error occur if the connection is already closed?**

^ Here, instead, we see that the psql_connection_id 3 that received an error is released and set to inactive, but then is immediately given a .readyForQuery(.idle) on the same connection ID, which results in the crash.

I have a suspicion and a workaround:

  1. I suspect that this may be due to the fact that the state of the ConnectionStateMachine is not placed in a read-write lock, and there is a race condition where the state is set to readyForQuery, receives a message, and then some other thread sets it to closed when an error is being handled, resulting in the assertion failure.
  2. My workaround was to change lines 985-986 of ConnectionStateMachine.swift from
        case .closed:
            preconditionFailure("How can an error occur if the connection is already closed?")

    to

        case .closed:
            let cleanupContext = self.setErrorAndCreateCleanupContext(error)
            return .closeConnectionAndCleanup(cleanupContext)

    Which is the same as in the closing case. This seemed to get rid of the problem and allow the channels to catch up and resolve themselves, and further queries executed without issue. However, we were still able to enter an "unreachable" state, which seems to be a bug.

Expected behavior

Connection error handling should fully clean up and close its connection without additional messages being allowed on that connection.

Environment

Seen in both Xcode 14.0.1 using swift 5.7+ and on ubuntu 20.04 using swift 5.7+ postgres-nio 1.15.0

Additional context

Again, my suspicion is the lack of a lock around the ConnectionStateMachine's State, but I am unsure if that was done intentionally. I also wonder if it may just be that the bad state can happen occasionally and the solution should be to handle it gracefully rather than with an assertion failure.

Austinpayne commented 1 year ago

@gwynne @fabianfett Does the fix (workaround?) described seem to be a useful change to make?

gwynne commented 1 year ago

I'll let @fabianfett take this one, that's all his code 😅

fabianfett commented 1 year ago

@GNMoseke Looking into this. There is multiple things going on here:

  1. I'm not sure, we should even close the connection here based on the error.
  2. We run into a reentrancy issue here. Based on the info at hand, we want to kill the info, but we still have data in our inbound buffer remaining.
fabianfett commented 1 year ago

@Austinpayne, @GNMoseke #379 should address the crash that you are seeing. Would you mind validating this fix?

fabianfett commented 1 year ago

Please reopen this issue, if above fix doesn't solve your problem.