Open vsarunas opened 13 hours ago
Okay, in the graph we have n
tasks, each repeatedly jumping back and forth between a task-specific actor and an actor that's shared across all tasks. So it's a baseline expectation that this is going to scale badly because it's heavily contended on the shared actor, and we're trying to decide if this is surprisingly bad, even beyond that baseline. This is difficult, because there are a bunch of different reasons this could be scaling badly.
Since this benchmark is fundamentally heavily contended on a single resource, as the number of actual cores trying to do work increases, we do expect the costs of contention to increase, i.e. more time to be wasted trying to perform atomic sequences. So we always need to consider that part of the difference between OSes may just be that macOS Dispatch is more conservative about bringing up new threads and/or that the macOS kernel is more conservative about scheduling those threads onto cores, and that results in less contention because there are fewer cores involved.
We can also see from the stack trace that we're spending a lot of time in Linux's sem_post
, mostly in the underlying futex implementation in the kernel. When a Swift task leaves an actor that's got more jobs to do, it schedules a job on the global thread pool to keep processing the actor. It's unsurprising that a thread pool would use a condition variable to manage idle threads, but it might be surprising that we're spending quite this much time in the condition variable. This could just be an artifact of the benchmark: we're doing relatively trivial amounts of work, so the heavy contention might just mean that all our jobs are very short outside of the contended sections. I suppose it's also possible that Linux's arm64 futex isn't very well tuned, at least in the kernel used by Ubuntu 24.04. More likely, the thread pool in Dispatch should just be doing a better job of avoiding the condition variable, e.g. by briefly polling the job queue before sleeping.
We could also try to improve throughput at a higher level by having processing jobs stay with the actor (when there are jobs to run there) instead of following the task like they normally do.
Description
To follow up on the discussion in https://github.com/swiftlang/swift-corelibs-libdispatch/issues/760, as mentioned by @ktoso here and @rjmccall here; I wanted to demonstrate how not controlling thread count can significantly impact dispatch performance on Linux.
Using a Mac Mini M4 Pro (14-core variant) running Ubuntu 24.04 LTS in a Multipass VM, I tested performance by varying the number of cores available to the VM (as I couldn't find a way to control this via environment variables) and running a basic minimal actor test below.
The results show that while macOS performance remains constant, the same code on Linux degrades as more CPU cores are added. The Linux VM with 14 cores performs several times worse than when run with 2 cores:
Performance on x86 processors also degrades as the number of CPU cores in the system increases.
Stack
DefaultActorImpl::unlock(bool); then into dispatch and futex calls in the kernel:
Strangely
DefaultActorImpl::unlock()
is filtered out here but the rest is intact:Reproduction
Steps to reproduce
The example from https://github.com/swiftlang/swift/issues/68299 is suitable for running on Linux with different CPU cores.
actor-latency-swift.swift
```swift // https://github.com/snaury/coroactors/blob/main/src/comparisons/actor-latency-swift.swift let clock = ContinuousClock.continuous actor Pingable { private var counter: Int = 0 func ping() -> Int { counter += 1 return counter } func getCounter() -> Int { return counter } } actor Pinger { private var target: Pingable init(_ target: Pingable) { self.target = target } func run(_ n: Int, start: ContinuousClock.Instant, withLatencies: Bool = true) async -> Duration { var maxLatency: Duration = .seconds(0) if withLatencies { var end = clock.now maxLatency = end - start for _ in 0..actor-latency-swift.sh
``` #!/bin/bash #export LIBDISPATCH_COOPERATIVE_POOL_STRICT=1 swiftc -swift-version 6 -O actor-latency-swift.swift for p in 1 2 8 512 4096 8192 16384 32768; do echo -n "$p pingers..." ./actor-latency-swift --pingables 1 --pingers $p -c 10000000 done ```Expected behavior
Performance should not degrade when running on a host with more CPU cores; it should remain consistent, as it does on macOS
Environment
Additional information
No response