vapor / postgres-nio

🐘 Non-blocking, event-driven Swift client for PostgreSQL.
https://api.vapor.codes/postgresnio/documentation/postgresnio/
MIT License
317 stars 72 forks source link

Miscompile causing segfault within ConnectionStateMachine.avoidingStateMachineCoW() in nightly and 5.9 pre-release toolchains #340

Closed BradLarson closed 1 year ago

BradLarson commented 1 year ago

Describe the bug

Starting with nightly Swift snapshots in early January (and with the 5.9 Swift release branch), we've been experiencing segmentation faults in release builds of a project using PostgresNIO. We've been able to minimize this to what appears to be a miscompile of the avoidingStateMachineCoW() function on ConnectionStateMachine due to something going wrong in an optimization pass.

This seems to be a compiler issue, and we've reported it on apple/swift here. Until fixed within the compiler, people using PostgresNIO in a project built for release may start experiencing random segfaults starting with the 5.9 toolchain. We're examining the compiler-side optimization issue further.

Removing the avoidingStateMachineCoW() function and operating directly on the ConnectionStateMachine prevents this miscompile. A question we have is whether avoidingStateMachineCoW() provided a significant performance benefit, and if so whether that was still the case under current versions of the compiler. If it does not have a noticeable performance impact today, would it be possible to remove this function to evade the miscompile?

To Reproduce

We've provided steps to set up a sample project that exhibits this behavior in our issue on apple/swift. Running a swift run -c release should lead to a segfault within PostgresNIO using Swift nightly toolchains from the start of January onward, or toolchains based on the 5.9 release branch.

Additionally, you can see the premature release of a context object occurs within the avoidingStateMachineCoW() function by changing the @inline(__always) to @inline(never) before the definition of avoidingStateMachineCoW() and running address sanitizer using swift run -c release --sanitize=address. You'll get a report that looks like the following:

=================================================================
==20790==ERROR: AddressSanitizer: heap-use-after-free on address 0x00010bb05a20 at pc 0x000105ce3664 bp 0x00016b024d30 sp 0x00016b024d28
READ of size 8 at 0x00010bb05a20 thread T1
    #0 0x105ce3660 in PostgresChannelHandler.run(_:with:) PostgresChannelHandler.swift:253
    #1 0x105ce893c in closure #1 in PostgresChannelHandler.channelRead(context:data:) PostgresChannelHandler.swift:147
    #2 0x105cfdaac in specialized NIOSingleStepByteToMessageProcessor._decodeLoop(decodeMode:seenEOF:_:) PostgresBackendMessageDecoder.swift
    #3 0x105ce4864 in PostgresChannelHandler.channelRead(context:data:) PostgresChannelHandler.swift
    #4 0x1056990c4 in ChannelHandlerContext.invokeChannelRead(_:) ChannelPipeline.swift:1702
    #5 0x10569adb0 in ChannelPipeline.SynchronousOperations.fireChannelRead(_:) ChannelPipeline.swift:1162
    #6 0x1059a7f08 in BaseStreamSocketChannel.readFromSocket() BaseStreamSocketChannel.swift:133
    #7 0x105a1f9f8 in specialized BaseSocketChannel.readable0()+0xbc (PostgresNIOSegfault:arm64+0x100bc39f8) (BuildId: 5a8de09d30bb3b3cbc347ad7250208ff32000000280000000100000000000b00)
    #8 0x105a30704 in specialized SelectableEventLoop.handleEvent<A>(_:channel:) SelectableEventLoop.swift:394
    #9 0x105a25e38 in closure #2 in closure #2 in SelectableEventLoop.run() SelectableEventLoop.swift:469
    #10 0x105a32064 in partial apply for closure #2 in closure #2 in SelectableEventLoop.run() <compiler-generated>
    #11 0x105a292a8 in specialized Selector.whenReady0(strategy:onLoopBegin:_:) SelectorKqueue.swift:258
    #12 0x105a2068c in SelectableEventLoop.run() SelectableEventLoop.swift:460
    #13 0x1059e14a0 in closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) MultiThreadedEventLoopGroup.swift:116
    #14 0x1059e8fa8 in partial apply for closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) <compiler-generated>
    #15 0x1059edf08 in thunk for @escaping @callee_guaranteed (@guaranteed NIOThread) -> () <compiler-generated>
    #16 0x105a5807c in closure #1 in static ThreadOpsPosix.run(handle:args:detachThread:) ThreadPosix.swift:105
    #17 0x19c52a068 in _pthread_start+0x90 (libsystem_pthread.dylib:arm64+0x7068) (BuildId: 9f3b729aed043e65adacd75ad06ebbdc32000000200000000100000000020d00)
    #18 0x9d7080019c524e28  (<unknown module>)

0x00010bb05a20 is located 128 bytes inside of 136-byte region [0x00010bb059a0,0x00010bb05a28)
freed by thread T1 here:
    #0 0x108bc83ac in __sanitizer_mz_free+0xdc (libclang_rt.asan_osx_dynamic.dylib:arm64+0x403ac) (BuildId: b712dddb50d83e2ab18213d85b3dacef32000000200000000100000000000b00)
    #1 0x1aa68f950 in _swift_release_dealloc+0x34 (libswiftCore.dylib:arm64+0x3be950) (BuildId: 5034430d70ab33459d7df5e05067eeb232000000200000000100000000000b00)
    #2 0x5b08800105c738dc  (<unknown module>)
    #3 0x105c5db80 in specialized ConnectionStateMachine.avoidingStateMachineCoW<A>(_:)+0x2bc (PostgresNIOSegfault:arm64+0x100e01b80) (BuildId: 5a8de09d30bb3b3cbc347ad7250208ff32000000280000000100000000000b00)
    #4 0x105c593f4 in specialized ConnectionStateMachine.avoidingStateMachineCoW<A>(_:)+0x28 (PostgresNIOSegfault:arm64+0x100dfd3f4) (BuildId: 5a8de09d30bb3b3cbc347ad7250208ff32000000280000000100000000000b00)
    #5 0x105c7a63c in ConnectionStateMachine.parseCompleteReceived()+0x4c0 (PostgresNIOSegfault:arm64+0x100e1e63c) (BuildId: 5a8de09d30bb3b3cbc347ad7250208ff32000000280000000100000000000b00)
    #6 0x105ce8dfc in closure #1 in PostgresChannelHandler.channelRead(context:data:) PostgresChannelHandler.swift:134
    #7 0x105cfdaac in specialized NIOSingleStepByteToMessageProcessor._decodeLoop(decodeMode:seenEOF:_:) PostgresBackendMessageDecoder.swift
    #8 0x105ce4864 in PostgresChannelHandler.channelRead(context:data:) PostgresChannelHandler.swift
    #9 0x1056990c4 in ChannelHandlerContext.invokeChannelRead(_:) ChannelPipeline.swift:1702
    #10 0x10569adb0 in ChannelPipeline.SynchronousOperations.fireChannelRead(_:) ChannelPipeline.swift:1162
    #11 0x1059a7f08 in BaseStreamSocketChannel.readFromSocket() BaseStreamSocketChannel.swift:133
    #12 0x105a1f9f8 in specialized BaseSocketChannel.readable0()+0xbc (PostgresNIOSegfault:arm64+0x100bc39f8) (BuildId: 5a8de09d30bb3b3cbc347ad7250208ff32000000280000000100000000000b00)
    #13 0x105a30704 in specialized SelectableEventLoop.handleEvent<A>(_:channel:) SelectableEventLoop.swift:394
    #14 0x105a25e38 in closure #2 in closure #2 in SelectableEventLoop.run() SelectableEventLoop.swift:469
    #15 0x105a32064 in partial apply for closure #2 in closure #2 in SelectableEventLoop.run() <compiler-generated>
    #16 0x105a292a8 in specialized Selector.whenReady0(strategy:onLoopBegin:_:) SelectorKqueue.swift:258
    #17 0x105a2068c in SelectableEventLoop.run() SelectableEventLoop.swift:460
    #18 0x1059e14a0 in closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) MultiThreadedEventLoopGroup.swift:116
    #19 0x1059e8fa8 in partial apply for closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) <compiler-generated>
    #20 0x1059edf08 in thunk for @escaping @callee_guaranteed (@guaranteed NIOThread) -> () <compiler-generated>
    #21 0x105a5807c in closure #1 in static ThreadOpsPosix.run(handle:args:detachThread:) ThreadPosix.swift:105
    #22 0x19c52a068 in _pthread_start+0x90 (libsystem_pthread.dylib:arm64+0x7068) (BuildId: 9f3b729aed043e65adacd75ad06ebbdc32000000200000000100000000020d00)
    #23 0x9d7080019c524e28  (<unknown module>)

previously allocated by thread T1 here:
    #0 0x108bc7f2c in __sanitizer_mz_malloc+0x88 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x3ff2c) (BuildId: b712dddb50d83e2ab18213d85b3dacef32000000200000000100000000000b00)
    #1 0x19c392c1c in _malloc_zone_malloc_instrumented_or_legacy+0x7c (libsystem_malloc.dylib:arm64+0x20c1c) (BuildId: fa535b0545933a7893d77bcff7431df632000000200000000100000000020d00)
    #2 0x2f6b0001aa68d6f4  (<unknown module>)
    #3 0xda2c8001aa68d8f0  (<unknown module>)
    #4 0xa44d800105bba1d4  (<unknown module>)
    #5 0x105bbf8d8 in PostgresConnection.send(_:logger:) PostgresConnection.swift:590
    #6 0x105b8cdc4 in closure #1 in _EventLoopConnectionPoolPostgresDatabase.send(_:logger:) ConnectionPool+Postgres.swift:49
    #7 0x1050d5684 in closure #1 in EventLoopConnectionPool.withConnection<A>(logger:_:) EventLoopConnectionPool.swift:125
    #8 0x1056d3e1c in closure #1 in EventLoopFuture._flatMap<A>(_:) EventLoopFuture.swift:525
    #9 0x1056d0008 in CallbackList._run() EventLoopFuture.swift:94
    #10 0x104f74bb4 in specialized EventLoopPromise._resolve(value:) <compiler-generated>
    #11 0x105cbacac in specialized PSQLEventsHandler.userInboundEventTriggered(context:event:) <compiler-generated>
    #12 0x10569937c in ChannelHandlerContext.invokeUserInboundEventTriggered(_:) ChannelPipeline.swift:1742
    #13 0x105693f94 in ChannelHandlerContext.fireUserInboundEventTriggered(_:) ChannelPipeline.swift:1537
    #14 0x105ce2788 in PostgresChannelHandler.run(_:with:) PostgresChannelHandler.swift:297
    #15 0x105ce893c in closure #1 in PostgresChannelHandler.channelRead(context:data:) PostgresChannelHandler.swift:147
    #16 0x105cfdaac in specialized NIOSingleStepByteToMessageProcessor._decodeLoop(decodeMode:seenEOF:_:) PostgresBackendMessageDecoder.swift
    #17 0x105ce4864 in PostgresChannelHandler.channelRead(context:data:) PostgresChannelHandler.swift
    #18 0x1056990c4 in ChannelHandlerContext.invokeChannelRead(_:) ChannelPipeline.swift:1702
    #19 0x10569adb0 in ChannelPipeline.SynchronousOperations.fireChannelRead(_:) ChannelPipeline.swift:1162
    #20 0x1059a7f08 in BaseStreamSocketChannel.readFromSocket() BaseStreamSocketChannel.swift:133
    #21 0x105a1f9f8 in specialized BaseSocketChannel.readable0()+0xbc (PostgresNIOSegfault:arm64+0x100bc39f8) (BuildId: 5a8de09d30bb3b3cbc347ad7250208ff32000000280000000100000000000b00)
    #22 0x105a30704 in specialized SelectableEventLoop.handleEvent<A>(_:channel:) SelectableEventLoop.swift:394
    #23 0x105a25e38 in closure #2 in closure #2 in SelectableEventLoop.run() SelectableEventLoop.swift:469
    #24 0x105a32064 in partial apply for closure #2 in closure #2 in SelectableEventLoop.run() <compiler-generated>
    #25 0x105a292a8 in specialized Selector.whenReady0(strategy:onLoopBegin:_:) SelectorKqueue.swift:258
    #26 0x105a2068c in SelectableEventLoop.run() SelectableEventLoop.swift:460
    #27 0x1059e14a0 in closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) MultiThreadedEventLoopGroup.swift:116
    #28 0x1059e8fa8 in partial apply for closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) <compiler-generated>
    #29 0x1059edf08 in thunk for @escaping @callee_guaranteed (@guaranteed NIOThread) -> () <compiler-generated>

Thread T1 created by T0 here:
    #0 0x108bc19b0 in wrap_pthread_create+0x50 (libclang_rt.asan_osx_dynamic.dylib:arm64+0x399b0) (BuildId: b712dddb50d83e2ab18213d85b3dacef32000000200000000100000000000b00)
    #1 0x1059e7d0c in specialized static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:parentGroup:selectorFactory:initializer:) MultiThreadedEventLoopGroup.swift:115
    #2 0x1059e8164 in specialized Collection.map<A>(_:) <compiler-generated>
    #3 0x1059e1d40 in MultiThreadedEventLoopGroup.init(threadInitializers:selectorFactory:) MultiThreadedEventLoopGroup.swift:161
    #4 0x1059e18b4 in MultiThreadedEventLoopGroup.__allocating_init(numberOfThreads:) MultiThreadedEventLoopGroup.swift:141
    #5 0x105d31d64 in main main.swift:2
    #6 0x19c1ffe4c  (<unknown module>)
    #7 0xca07fffffffffffc  (<unknown module>)

SUMMARY: AddressSanitizer: heap-use-after-free PostgresChannelHandler.swift:253 in PostgresChannelHandler.run(_:with:)
Shadow bytes around the buggy address:
  0x007021780af0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007021780b00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007021780b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007021780b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x007021780b30: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
=>0x007021780b40: fd fd fd fd[fd]fa fa fa fa fa fa fa fa fa fd fd
  0x007021780b50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa
  0x007021780b60: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x007021780b70: fd fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa
  0x007021780b80: fa fa fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x007021780b90: fd fd fd fa fa fa fa fa fa fa fa fa fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==20790==ABORTING

Expected behavior

The example project should run to completion without a segfault under optimized builds, as it does for debug builds or when avoidingStateMachineCoW() is removed in favor of direct access to the ConnectionStateMachine.

Environment

fabianfett commented 1 year ago

Looking at CI this seems to be resolved. Can we close this?

BradLarson commented 1 year ago

I just tested the above-described reproducer against the latest PostgresNIO commits (tagged version 1.14.0) and the latest nightly Swift toolchain release (2023-04-20) and the segmentation fault and related address sanitizer warnings are still there in the same places. It's possible that CI is not experiencing the same conditions, running an optimized build such that parseCompleteReceived() and therefore avoidingStateMachineCoW() are called in the same way. Debug or unoptimized builds will not exhibit this same behavior.

BradLarson commented 1 year ago

This miscompile should be resolved at the compiler level as of https://github.com/apple/swift/pull/66221