Assigning an array element to another is sometimes abysmally slow

za-creature commented 1 year ago

Description In certain, tricky to reproduce, circumstances,

data[2] = data[next]

is (at least) an order of magnitude slower than

let wtf = data[next]
data[2] = wtf

Steps to reproduce Apologies for the long report, I did my best to replicate this using a simpler example but was unable to.

I've implemented a bounded priority queue using an interval heap (tldr: an inline min-max heap a.k.a. the unholy mixture of a min heap with a max heap) and this code is derived from the fuzzer I used to test it (please note that this is work in progress and not available for redistribution)

The actual issue is around line 77 in the following code:

class BoundedPriorityQueue<Item: Comparable> {
    // a bounded priority queue backed by an interval heap (inline min-max heap)
    // reserves two nil sentries as the parent of root to avoid negative range checks
    private var data: [Item?]
    public init(capacity: Int) {
        assert(capacity % 4 == 2, "capacity must be an odd multiple of 2")
        data = [Item?](repeating: nil, count: capacity + 2)
        limit = data.count>>1
    }

    @inline(__always) private func swap(_ i: Int, _ j: Int) {
        guard let a = data[i], let b = data[j], a > b else { return }
        data[j] = a
        data[i] = b
    }

    @inline(__always) private func compare(_ i: Int, _ j: Int) -> Bool {
        guard let a = data[i], let b = data[j] else { return false }
        return a > b
    }

    @inline(__always) private func compareAndSwap(_ i: Int, _ j: Int) -> Bool {
        guard let a = data[i], let b = data[j], a > b else { return false }
        data[j] = a
        data[i] = b
        return true
    }

    // heap up: used when inserting new elements
    private func upmin(_ p: Int) {
        var i: Int, p = p; repeat {
            i = p
            p = ((i>>2)<<1)|1
        } while compareAndSwap(p, i)
    }

    private func upmax(_ p: Int) {
        var i: Int, p = p; repeat {
            i = p
            p = (i>>2)<<1
        } while compareAndSwap(i, p)
    }

    private let limit: Int
    private func downmin() {
        // used when evicting the lowest priority message in favor of a new one
        // the queue is always full when this is called
        var l = 3, i, r: Int; repeat {
            swap(l, l-1)
            guard l < limit else { break }
            i = l
            r = (l<<1)|1
            l = r-2
            if compare(l, r) { l = r }
        } while compareAndSwap(i, l)
    }

    private func downmax() {
        // used when consuming the highest priority message
        var l = 2, i, r: Int; repeat {
            swap(l+1, l)
            guard l < limit else { break }
            i = l
            l = l<<1
            r = l+2
            if compare(r, l) { l = r }
        } while compareAndSwap(l, i)
    }

    private var next = 2 // index of next inserted element
    public func pop() -> Item? {
        guard next > 2 else { return nil }
        let max = data[2]
        next -= 1

        // MARK: START OF BUG REPORT
        //data[2] = data[next]
        let wtf = data[next]
        data[2] = wtf
        // MARK: END OF BUG REPORT

        data[next] = nil
        downmax()
        return max
    }

    public func append<T>(_ entry: Item, _ evict: ((Item) throws -> T?)? = nil) rethrows -> T? {
        // adds `entry` to the queue if possible
        // if this requires another item to be evicted, returns the result
        // (or exception) of calling `evict(item)` where `item` has the
        // lowest priority of all known items and _MAY_ be `entry` itself
        // if `evict` is not called, the return value is `nil`
        if next < data.count {
            data[next] = entry
            if next & 1 != 0 {
                // next is min, max exists at next-1
                let max = next - 1
                if compareAndSwap(next, max) {
                    upmax(max)
                } else {
                    upmin(next)
                }
            } else {
                // next is max, closest min is in the parent
                let min = ((next>>2)<<1)|1
                if compareAndSwap(min, next) {
                    upmin(min)
                } else {
                    upmax(next)
                }
            }
            next += 1
            return nil
        } else {
            // replace lowest prio; next is at least 4 and remains unchanged
            let min = data[3]!
            guard min < entry else {
                return try evict?(entry)
            }

            data[3] = entry
            downmin()
            return try evict?(min)
        }
    }
}

// MARK: test harness
import Foundation
import SwiftUI

func measure<Result>(block: () throws -> Result) rethrows -> Result {
    func now() -> UInt64 { clock_gettime_nsec_np(CLOCK_MONOTONIC_RAW) }
    let start = now()
    defer { print("\((now() - start) / 1_000_000)ms") }
    return try block()
}

func benchmark() { measure {
    var evicted = [Int]()
    @inline(__always) func evict(_ val: Int) { evicted.append(val) }

    let samples = 100_000 // number of samples in test
    let capacity = 50_002  // size of queue (odd multiple of 2)
    let range = 10_000_000 // max start range
    var output = Array(0..<capacity)

    let queue = BoundedPriorityQueue<Int>(capacity: capacity)

    let lower = Int.random(in: 0..<range)
    var input = Array(lower..<lower + samples)

    input.shuffle()
    evicted.removeAll(keepingCapacity: true)

    //startMeasuring()
    let logStep = samples / 10
    var count = 0
    input.forEach {
        queue.append($0, evict)
        count += 1
        if count % logStep == 0 { print("step") }
    }
    for i in 0..<capacity {
        output[i] = queue.pop()!
        count += 1
        if count % logStep == 0 { print("pop step") }
    }
    //stopMeasuring()

    evicted.sort()
    //let first = lower + samples - capacity
    //XCTAssertEqual(output.reversed(), Array((first..<lower + samples)))
    //XCTAssertEqual(evicted, Array(lower..<first))
}}

@main
struct BenchmarkApp: App {
    var body: some Scene {
        WindowGroup {
            Text("Hello, world!").onAppear(perform: benchmark)
        }
    }
}

Expected behavior These are the timings I measured on my setup (macx86 compiler, iphone 13 mini target):

For samples = 100k, capacity = 50_002 (default included in report, sorry for asking you to (un)comment some code to compare)

without local: 3.918s
with    local: 0.678s

For samples = 1M, capacity = 500_002 (heads up, this takes about 7 minutes)

without local: 372.236s
with    local:   8.457s

Heaps are nominally n log n (assuming my implementation is at least somewhat correct), so the with local timings check out. Not sure where the slowdown comes from, but it's a pretty big one:

To my knowledge, there are no (non-stack) allocations involved in the provided sample.
I came about this while pausing the debugger at random and it always stopped at the same line.

Environment

1.75.2 Apple Swift version 5.8.1 (swiftlang-5.8.0.124.5 clang-1403.0.22.11.100)
Xcode 14.3.1 (14E300c)
Deployment target: 14.7, 15.0

tbkka commented 1 year ago

When you stop the debugger and it shows on this line, what is the full backtrace? (If you're using Xcode, you may need to go into the LLDB window and type bt.) I'm curious whether there is some Swift runtime code getting invoked for that line.

P.S. Pausing the debugger at random is one of my favorite performance-testing tricks. ;-)

za-creature commented 1 year ago

Good call on the stack trace, I noticed that the top was busy doing a memmove but didn't spot the ArrayBuffer.consumeAndCreateNew from the next frame which indicates that a new copy is made for every assignment (though, that would explain the n^2 observed complexity)

thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000216c6caa8 libsystem_platform.dylib`_platform_memmove + 88
    frame #1: 0x00000001aa525568 libswiftCore.dylib`Swift._ArrayBuffer._consumeAndCreateNew(bufferIsUnique: Swift.Bool, minimumCapacity: Swift.Int, growForAppend: Swift.Bool) -> Swift._ArrayBuffer<τ_0_0> + 668
    frame #2: 0x00000001aa52c208 libswiftCore.dylib`Swift.Array._makeMutableAndUnique() -> () + 100
    frame #3: 0x00000001aa52cd38 libswiftCore.dylib`Swift.Array.subscript.modify : (Swift.Int) -> τ_0_0 + 48
    frame #4: 0x000000010263e918 Benchmark`BoundedPriorityQueue.pop(self=0x0000000281320c90) at BenchmarkApp.swift:77:17
    frame #5: 0x0000000102640540 Benchmark`closure #1 in benchmark() at BenchmarkApp.swift:167:27
    frame #6: 0x000000010263fce8 Benchmark`measure<Result>(block=0x10263ff78) at BenchmarkApp.swift:137:16
    frame #7: 0x000000010263ff54 Benchmark`benchmark() at BenchmarkApp.swift:141:20

tbkka commented 1 year ago

Ah. That was @jckarter's guess, and yes it would certainly explain the slowdown. Somehow, the assignment

data[2] = data[next]

is holding an access to the array (for the read) while performing the write, thus forcing a copy.

Maybe @atrick, @nate-chandler, or @meg-gupta have ideas here?

tbkka commented 1 year ago

@jckarter @atrick is this really a SILGen problem? Should the RHS and LHS of an assignment be non-overlapping access scopes?

atrick commented 1 year ago

Should the RHS and LHS of an assignment be non-overlapping access scopes?

Yes, the access on the RHS needs to complete before the access on the LHS starts. If that's not the case, then we have a serious problem.

za-creature commented 1 year ago

Looks like this is a struct vs class issue:

class TestClass<Item> {
    private var data: [Item?]
    public init(capacity: Int) {
        data = Array(repeating: nil, count: capacity)
    }

    func copy(from: Int, to: Int) {
        data[to] = data[from]
    }
}

struct TestStruct<Item> {
    private var data: [Item?]
    public init(capacity: Int) {
        data = Array(repeating: nil, count: capacity)
    }

    mutating func copy(from: Int, to: Int) {
        data[to] = data[from]
    }
}

import Foundation
func measure<Result>(_ label:String, block: () throws -> Result) rethrows -> Result {
    func now() -> UInt64 { clock_gettime_nsec_np(CLOCK_MONOTONIC_RAW) }
    let start = now()
    defer { print("\(label) \((now() - start) / 1_000_000)ms") }
    return try block()
}

func benchmark() {
    let count = 100_000
    let accessPattern = Array(0..<count).shuffled()
    measure("class") {
        var test = TestClass<Int>(capacity: count)
        for i in 0..<count {
            test.copy(from: i, to: accessPattern[i])
        }
    }

    measure("struct") {
        var test = TestStruct<Int>(capacity: count)
        for i in 0..<count {
            test.copy(from: i, to: accessPattern[i])
        }
    }
}

import SwiftUI
@main
struct BenchmarkApp: App {
    var body: some Scene {
        WindowGroup {
            Text("Hello, world!").onAppear(perform: benchmark)
        }
    }
}

Bench:

class 12993ms
struct 34ms

za-creature commented 1 year ago

After some R&R with the profiler, my key takeaway from all of this is that once you're certain that your accesses are synchronized and within bounds, swap out Array[T] for UnsafeMutablePointer[T] and spare a few electrons. It's not really a new problem but, accidentally quadratic issues aside, Array[T] is rather slow which is likely why this wasn't spotted before.

tbkka commented 10 months ago

@za-creature Sorry for losing the thread here.

Does this still reproduce with a newer compiler?
If so, is there some way we can get a copy of your source code so we can study what's going on here? (We promise not to share it!)

za-creature commented 10 months ago

Happy to share my code with Apple, but that's a "for-future-reference" sort of thing:

I've re-ran the code from my prevous comment on my phone (13 mini) and got the same result with Xcode Version 15.0.1 (15A507):

class 12303ms
struct 32ms

I'll send you the full project if you need it, but iirc, I just created a new swiftui app, deleted all files but the one that contained @main and replaced it with the contents of that comment. It should show up as an empty screen, with output on console (after around 15 seconds for count = 100_000).

tbkka commented 10 months ago

Let me try that...

tbkka commented 10 months ago

Tried that and found something interesting: If I set Xcode to Debug build, I do see the discrepancy you describe. But in a Release build, I see very similar performance for these two cases. Do you also see that?

tbkka commented 10 months ago

@meg-gupta @eeckstein: Are we missing some borrow scope fixup pass in -Onone builds?

za-creature commented 10 months ago

Can somewhat confirm (haven't figured out how to do a release build but that's probably cause I deleted too much stuff from my project):

The previous timings (12s vs 30ms) hold for -Onone

For the other options, I had to bump let count = 30_000_000 cause it was faster than my shabby perf tool could handle and ended up with (any higher than that and my phone OOMs, didn't do the math but I'll take its word for it):

For -Osize, the timings are:

class 5074ms
struct 353ms

With '-O', they're

class 367ms
struct 347ms

I consider the struct timings to be well within tolerance, and can attribute the slight speed bump for class-O to object indirection but the -Osize almost-but-not-quite-there behavior is still weird.

It might be that the root cause is still there, but the optimizer is smart enough to at least partially fix it in this simple case.

It's good that this isn't an issue for prod builds, but the fact that this appears to be resolved by the optimizer and not the code generator is still making me feel uneasy somehow. It's been about a decade since I last wrote low level code so I'm not up to date with compiler etiquette but by my (possibly dated) knowledge, I'm willing to pay a 50% performance penalty for range checks and the ability to attach a debugger, but not an order of magnitude more.

tbkka commented 10 months ago

Thank you. I agree that Debug builds should not show a discrepancy this big; I just wanted to make sure I was actually seeing the same issue. It looks like there might be some optimization passes that should be enabled in -Onone and -Osize but are not currently. (A number of our "optimization passes" handle what you might consider "code generation." It's just a peculiarity of how we've organized the compiler internals.)

swiftlang / swift

Assigning an array element to another is sometimes abysmally slow #67655