optimizely / swift-sdk

Swift SDK for Optimizely Feature Experimentation and Optimizely Full Stack (legacy)
https://www.optimizely.com/products/full-stack/
Apache License 2.0
21 stars 30 forks source link

[BUG] `AtomicProperty` deadlocks in heavily concurrent settings #517

Closed stephanecopin closed 8 months ago

stephanecopin commented 1 year ago

Is there an existing issue for this?

SDK Version

master

iOS Version

Any

Current Behavior

When a lot of work is done concurrently, AtomicQueue's getter deadlocks as its internal queue is waiting for be freed up. If the queues doing work are waiting on AtomicQueue's property value, then they both wait on each other, creating a deadlock.

In terms of behavior from the user perspective, if the deadlock happens on the main thread, the app will freeze and the user will have to force-kill the app. This happens mostly on older devices (iPhone 6S/iPhone 7/iPad 7th Gen).

Expected Behavior

AtomicProperty should never deadlock.

Steps To Reproduce

Run this test:

func test_atomicPropertyDeadlocks() async {
    let operationQueue = OperationQueue()
    let expectations = (0..<80).map { id in
        let expectation = expectation(description: "Queue Test \(id)")
        operationQueue.addOperation {
            self.subject.property = (self.subject.property ?? 0) + 1
            Thread.sleep(forTimeInterval: 0.1)
            self.subject.property = (self.subject.property ?? 0) + 1
            expectation.fulfill()
        }
        return expectation
    }

    let finalExpectation = expectation(description: "Final")
    operationQueue.addBarrierBlock {
        finalExpectation.fulfill()
    }

    await fulfillment(of: expectations + CollectionOfOne(finalExpectation), timeout: 10.0)

    self.subject.property = 0
}

The test never finishes.

The less CPU cores the target device has, the easier is it to reproduce.

Link

See above.

Logs / Stacktraces

No response

Severity

Affecting users

Workaround/Solution

Use a lock that doesn't leverage a queue would fix the issue. For example, using NSRecursiveLock:


class AtomicProperty<T> {
    private var _property: T?
    var property: T? {
        get {
            lock.withLock {
                _property
            }
        }
        set {
            lock.withLock {
                _property = newValue
            }
        }
    }
    private let lock = NSRecursiveLock()

    init(property: T?) {
        self._property = property
    }

    convenience init() {
        self.init(property: nil)
    }

    // perform an atomic operation on the atomic property
    // the operation will not run if the property is nil.
    func performAtomic(atomicOperation: (_ prop:inout T) -> Void) {
        lock.withLock {
            if var prop = _property {
                atomicOperation(&prop)
                _property = prop
            }
        }
    }
}

Fixes the deadlock.

Recent Change

No response

Conflicts

No response

muzahidul-opti commented 1 year ago

Acknowledged, will get back to you.

muzahidul-opti commented 1 year ago

Hi @stephanecopin thanks for your effort. I went through your test case. It looks like not a dead lock issue. You submit too many tasks concurrently, which leads to the system running out of threads. Does our current code cause your app to freeze?

stephanecopin commented 1 year ago

Hi @muzahidul-opti, indeed, this is basically what is happening. AtomicProperty waits for a queue to be available, but none are, resulting in a deadlock (the pending queues are waiting to read an AtomicProperty value, but they can't as there's no more queues, so all the queues are waiting for another queue to free up, which cannot happen). The test case is just an exaggerated way to consistently reproduce it on performant devices. This issue is currently happening in production and affecting some of our users with lower-end devices, making the app freeze.

In the meantime, we've resorted to forking the repository and implementing the fix as mentioned in the PR, but ideally it should be fixed here directly.

Thanks for looking into it!

muzahidul-opti commented 1 year ago

Hi @stephanecopin thanks for the insight. Will get back to you.

Tamara-Barum commented 11 months ago

Hello, I know it has been 1 month. This is an update that a ticket is created on our side and we're continuing to investigate. We are digging in to try to determine the true root cause. As you have a work around, we have this on our backlog and in investigation with our swift Eng, but it will be deprioritized to future maintenance effort. We will continue to update here as we have progress.

sergiofresneda commented 8 months ago

Hi, some news about this topic? Thanks!

muzahidul-opti commented 8 months ago

Hi @sergiofresneda, thank you for your patience and for sharing your proposed locking approach with our team. We've thoroughly reviewed the solution, and while it appears to be effective in your sample test case, we are proceeding with caution, as we recognize the potential significant impact it could have on our broader production user base.

Considering the impact of the changes, our team has decided not to merge the solution at this time. We want to ensure that any modifications align seamlessly with our user and don't introduce unexpected impacts.

We appreciate your understanding in this matter and encourage you to proceed with the forking process, as you mentioned. Rest assured; we are actively exploring potential workarounds.

Thank you once again for your collaboration and valuable contribution to our project.