swiftlang / swift

The Swift Programming Language
https://swift.org
Apache License 2.0
67.69k stars 10.39k forks source link

CPU scaling degrades Actor ping-pong performance on Linux #77947

Open vsarunas opened 13 hours ago

vsarunas commented 13 hours ago

Description

To follow up on the discussion in https://github.com/swiftlang/swift-corelibs-libdispatch/issues/760, as mentioned by @ktoso here and @rjmccall here; I wanted to demonstrate how not controlling thread count can significantly impact dispatch performance on Linux.

Using a Mac Mini M4 Pro (14-core variant) running Ubuntu 24.04 LTS in a Multipass VM, I tested performance by varying the number of cores available to the VM (as I couldn't find a way to control this via environment variables) and running a basic minimal actor test below.

The results show that while macOS performance remains constant, the same code on Linux degrades as more CPU cores are added. The Linux VM with 14 cores performs several times worse than when run with 2 cores:

linux-core-count-actor
Raw data: Pingers macOS linux14c linux12c linux8c linux4c linux2c
1 11,846,331 11,421,579 11,364,196 11,869,799 11,347,104 11,556,880
2 3,299,910 6,445,558 6,123,505 3,953,612 3,700,211 4,939,033
8 2,979,551 3,019,615 2,465,688 2,670,892 3,381,066 5,511,328
512 3,028,266 1,330,248 1,366,468 1,587,204 3,161,267 4,942,333
4,096 3,202,390 1,264,782 1,308,264 1,640,846 3,294,145 5,304,749
8,192 3,017,156 1,288,134 1,277,824 1,581,985 3,317,190 5,398,518
16,384 2,877,146 1,312,917 1,307,864 1,752,558 3,150,326 4,416,344
32,768 2,660,722 1,273,489 1,375,541 1,670,825 2,765,453 3,656,408

Performance on x86 processors also degrades as the number of CPU cores in the system increases.

Stack

DefaultActorImpl::unlock(bool); then into dispatch and futex calls in the kernel:

actor-latency-s   33229  3404.987861:     250000 task-clock:ppp: 
        ffff800080132b5c try_to_wake_up+0x28c ([kernel.kallsyms])
        ffff8000801330e4 wake_up_q+0x6c ([kernel.kallsyms])
        ffff8000802064e4 futex_wake+0x1ac ([kernel.kallsyms])
        ffff800080202a24 do_futex+0x144 ([kernel.kallsyms])
        ffff800080202c40 __arm64_sys_futex+0xf8 ([kernel.kallsyms])
        ffff800080032994 invoke_syscall+0x7c ([kernel.kallsyms])
        ffff800080032a94 el0_svc_common.constprop.0+0x4c ([kernel.kallsyms])
        ffff800080032bb0 do_el0_svc+0x28 ([kernel.kallsyms])
        ffff80008165bfc4 el0_svc+0x44 ([kernel.kallsyms])
        ffff80008165c798 el0t_64_sync_handler+0x148 ([kernel.kallsyms])
        ffff800080011648 el0t_64_sync+0x1b0 ([kernel.kallsyms])
            e1c8590bdaf0 sem_post+0x80 (/usr/lib/aarch64-linux-gnu/libc.so.6)
            e1c858f3d550 _dispatch_sema4_signal+0x1c (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libdispatch.so)
            e1c858f36290 _dispatch_semaphore_signal_slow+0x14 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libdispatch.so)
            e1c858f31f5c _dispatch_root_queue_poke_slow+0x54 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libdispatch.so)
            e1c859b83a70 (anonymous namespace)::DefaultActorImpl::unlock(bool)+0x164 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libswift_Concurrency.so)
            e1c859b833e8 swift_task_switchImpl(swift::AsyncContext*, void ( swiftasynccall*)(swift::AsyncContext* swift_async_context), swift::SerialExecutorRef)+0x19c (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libswift_Concurrency.so)
            e1c859b8240c swift::runJobInEstablishedExecutorContext(swift::Job*)+0x19c (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libswift_Concurrency.so)
            e1c859b83728 (anonymous namespace)::ProcessOutOfLineJob::process(swift::Job*)+0x1f0 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libswift_Concurrency.so)
            e1c859b82398 swift::runJobInEstablishedExecutorContext(swift::Job*)+0x128 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libswift_Concurrency.so)
            e1c859b82f0c swift_job_run+0x9c (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libswift_Concurrency.so)
            e1c858f2a744 _dispatch_continuation_pop+0xec (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libdispatch.so)
            e1c858f2a574 _dispatch_async_redirect_invoke+0xb8 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libdispatch.so)
            e1c858f3562c _dispatch_worker_thread+0x1b0 (/usr/local/share/toolchains/6.0.2/usr/lib/swift/linux/libdispatch.so)

Strangely DefaultActorImpl::unlock() is filtered out here but the rest is intact:

sample-actor-latency-s-linux-dev-2024-12-04_153310

Reproduction

Steps to reproduce

The example from https://github.com/swiftlang/swift/issues/68299 is suitable for running on Linux with different CPU cores.

actor-latency-swift.swift ```swift // https://github.com/snaury/coroactors/blob/main/src/comparisons/actor-latency-swift.swift let clock = ContinuousClock.continuous actor Pingable { private var counter: Int = 0 func ping() -> Int { counter += 1 return counter } func getCounter() -> Int { return counter } } actor Pinger { private var target: Pingable init(_ target: Pingable) { self.target = target } func run(_ n: Int, start: ContinuousClock.Instant, withLatencies: Bool = true) async -> Duration { var maxLatency: Duration = .seconds(0) if withLatencies { var end = clock.now maxLatency = end - start for _ in 0..
actor-latency-swift.sh ``` #!/bin/bash #export LIBDISPATCH_COOPERATIVE_POOL_STRICT=1 swiftc -swift-version 6 -O actor-latency-swift.swift for p in 1 2 8 512 4096 8192 16384 32768; do echo -n "$p pingers..." ./actor-latency-swift --pingables 1 --pingers $p -c 10000000 done ```

Expected behavior

Performance should not degrade when running on a host with more CPU cores; it should remain consistent, as it does on macOS

Environment

$ swift --version
Swift version 6.0.2 (swift-6.0.2-RELEASE)
Target: aarch64-unknown-linux-gnu

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04.1 LTS
Release:        24.04
Codename:       noble

Additional information

No response

rjmccall commented 8 hours ago

Okay, in the graph we have n tasks, each repeatedly jumping back and forth between a task-specific actor and an actor that's shared across all tasks. So it's a baseline expectation that this is going to scale badly because it's heavily contended on the shared actor, and we're trying to decide if this is surprisingly bad, even beyond that baseline. This is difficult, because there are a bunch of different reasons this could be scaling badly.

Since this benchmark is fundamentally heavily contended on a single resource, as the number of actual cores trying to do work increases, we do expect the costs of contention to increase, i.e. more time to be wasted trying to perform atomic sequences. So we always need to consider that part of the difference between OSes may just be that macOS Dispatch is more conservative about bringing up new threads and/or that the macOS kernel is more conservative about scheduling those threads onto cores, and that results in less contention because there are fewer cores involved.

We can also see from the stack trace that we're spending a lot of time in Linux's sem_post, mostly in the underlying futex implementation in the kernel. When a Swift task leaves an actor that's got more jobs to do, it schedules a job on the global thread pool to keep processing the actor. It's unsurprising that a thread pool would use a condition variable to manage idle threads, but it might be surprising that we're spending quite this much time in the condition variable. This could just be an artifact of the benchmark: we're doing relatively trivial amounts of work, so the heavy contention might just mean that all our jobs are very short outside of the contended sections. I suppose it's also possible that Linux's arm64 futex isn't very well tuned, at least in the kernel used by Ubuntu 24.04. More likely, the thread pool in Dispatch should just be doing a better job of avoiding the condition variable, e.g. by briefly polling the job queue before sleeping.

We could also try to improve throughput at a higher level by having processing jobs stay with the actor (when there are jobs to run there) instead of following the task like they normally do.