rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.23k stars 12.7k forks source link

Tracking Issue for `#![feature(available_parallelism)]` #74479

Closed yoshuawuyts closed 2 years ago

yoshuawuyts commented 4 years ago

The feature gate for the issue is #![feature(available_parallelism)].

This is a tracking issue for std::thread::available_parallelism; a portable API to determine how many threads to spawn in order to ensure a program can make use of all available parallelism available on a machine.

Public API

#![feature(available_parallelism)]
use std::thread;

fn main() -> std::io::Result<()> {
    let count = thread::available_parallelism()?.get();
    assert!(count >= 1);
    Ok(())
}

Tasks

Notes on Platform-specific behavior

available_parallelism is (for now) only considered a hint. The following platform limitations exist:

Implementation history

yoshuawuyts commented 4 years ago

A fix for the linux undercounting exists in num_cpus which is license-compatible with the compiler. I believe the fix boils down to including this conditional for the linux detection code:

if unsafe { libc::sched_getaffinity(0, mem::size_of::<libc::cpu_set_t>(), &mut set) } == 0 {
    let mut count: u32 = 0;
    for i in 0..libc::CPU_SETSIZE as usize {
        if unsafe { libc::CPU_ISSET(i, &set) } {
            count += 1
        }
    }
    count as usize
} else {
    // `libc::_SC_NPROCESSORS_ONLN` code we're currently using
}

cc/ @luser

the8472 commented 4 years ago

@yoshuawuyts that would be insufficient due to cgroup CPU quotas which gets your entire cgroup frozen until the next time slice when the quota is used up. docker and other containerization services makes heavy use of that for resource allocation. One can and other runtimes (e.g. java) do calculate the equivalent of available full-time cpu-cores per time slice from the quotas.

see https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt

strohel commented 4 years ago

I would suggest that taking into account Linux a) cpusets (https://github.com/rust-lang/rust/issues/74479#issuecomment-693438368) and b) cgroup cpu bandwidth limits (https://github.com/rust-lang/rust/issues/74479#issuecomment-693672164) are considered blockers of stabilization of available_concurrency.

Rationale: the change b) in num_cpus was a source of significant CPU efficiency gains and potential memory savings in cloud-based environments. If available_concurrency is stabilized, then it is expected people will migrate to it from num_cpus. That would currently cause a regression.

This is complicated by the fact that associating "Kubernetes CPU limits" with "cgroups" and https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt is non-trivial as there are abstraction layers between them. (so people may not be aware of this even in the presence of documentation on available_concurrency)

The fix can also be to include some num_cpus code, but the cgroups part is more substantial than the cpusets one.

yoshuawuyts commented 4 years ago

@strohel I definitely agree we should support cpusets before stabilizing: we may currently over-report concurrency in containers which is something that should be fixed.

But I'm less sure about cgroup cpu bandwidth limits. If I'm understanding https://github.com/rust-lang/rust/issues/74479#issuecomment-693672164 this would cause the reported number of CPUs to fluctuate during the runtime of the program. I believe this may cause issues when used in examples like the following:

use std::thread;

fn main() -> io::Result<()> {
    // perform init work here (connect to db, warm caches, etc.)

    let max_threads = thread::available_concurrency().map(|n| n.get()).unwrap_or(1);
    let pool = ThreadPool::init(max_threads);

    // use pool for the duration of the program
}

If the boot sequence of a program is CPU-intensive, by the time the thread pool is initialized the CPU quota may have been expended. Which means that the thread pool may set a limit lower than the max supported concurrency for the program, causing an overall under-utilization of resources.

It would seem that the right use of a bandwidth-aware API would be to periodically update the concurrency. Not to statically initialize it once since that risks under-utilization if the quota is at a low. Did I interpret this accurately?

edit: On closer reading I think I may have indeed misinterpreted. The quota value is statically set, and not constantly updated during a program's runtime. Which means it can indeed be used to calculate the max concurrency and does not need to be updated in a loop. If this is indeed the case, I agree we should account for this.

strohel commented 4 years ago

@yoshuawuyts I see you've just edited the comment; the clarification is correct.

It is true the one can change CPU quota of a running program, but the same AFAIK holds about cpu sets. Linux even supports runtime physical CPU hotplugging. That's a non-issue, the API can do nothing about it (beyond mentioning in the docs). Caching of the value is problematic for this reason.

the8472 commented 4 years ago

If the boot sequence of a program is CPU-intensive, by the time the thread pool is initialized the CPU quota may have been expended. Which means that the thread pool may set a limit lower than the max supported concurrency for the program, causing an overall under-utilization of resources.

Perhaps my post was not clear. There is a quota-configuration which you can read from sysfs (as quota per interval and interval length) and the runtime depletion of the quota. The depletion is dynamic, but the configuration does not change unless someone explicitly reconfigures them. available_concurrency() would be based on the configuration, not the remaining quota available in the current time slice.

raphaelcohn commented 3 years ago

I'm deeply against including this in Rust's core API. Whilst conceptually it seems helpful and even useful, the technique used for Linux isn't just broken with regard to containers, it's broken in a number of other situations as well. It turns out doing this well is extremely hard. I've gone through several iterations of my own to get it 'right'. First up sysconf(_SC_NPROCESSORS_CONF) can not be relied upon eg in musl libc, as it uses the syscall sched_getaffinity. AS my notes in my linux-support crate make clear, it becomes unreliable after the sched_setaffinity syscall has been made. As well as taking into account current cgroup2 settings (eg effective cgroup hyper threads) one needs to be aware that hyper threading may have been disabled and some threads isolated from the OS (eg https://github.com/lemonrock/linux-support/blob/master/workspace/linux-support/src/cpu/HyperThread.rs#L171). Whether hyper threading matters or not to the count is also important; sometimes how threads share cpu caches and resources (eg https://github.com/lemonrock/linux-support/blob/master/workspace/linux-support/src/cpu/HyperThread.rs#L302), which NUMA nodes they are associated with, is it online etc; all have a significant impact.

Whilst wanting a count of cpus is a common request, naively just spinning up threads for all cpus, or all cpus minus 1, say, is never going to work unless one can ensure a dedicated OS instance to run on.

raphaelcohn commented 3 years ago

@strohel - the API can do something about CPU hotplugging. It's quite discoverable.

ibraheemdev commented 3 years ago

Is there any reason not to call this num_cpus?

yoshuawuyts commented 3 years ago

@ibraheemdev you can find a conversation that lead to the current naming here: https://github.com/rust-lang/rust/pull/74480#issuecomment-662045841.

The short version on "why not name it num_cpus?" is because this API does not return the number of CPUs in a computer (neither logical or physical). Instead it returns the amount of "concurrency" [1] available to a process at that time.


[1]: We borrowed the word "concurrency" from C++'s hardware_concurrency API.

joshtriplett commented 3 years ago

I submitted https://github.com/rust-lang/rust/pull/89310 to implement available_concurrency using sched_getaffinity on Linux, which handles the case where the process doesn't have access to all CPUs, such as when limited via taskset or similar.

raphaelcohn commented 3 years ago

@joshtriplett I still think stabilizing this is a really bad idea; see my comment above on sched_setaffinity. If you really want to optimize concurrency, then I think folks should use specialized crates; for example, my linux-support crate. It's far richer and lets one handle a lot of confusing situations.

joshtriplett commented 3 years ago

@raphaelcohn wrote:

If you really want to optimize concurrency, then I think folks should use specialized crates

Rust projects need some portable mechanism to determine a default number of threads to run, even if that's an approximation. (Projects currently use the num_cpus crate for this.) Crates that run on many platforms and need to use the full capabilities of the system's CPUs are not necessarily interested in adding various non-portable code paths to detect further nuances of the system. Some crates may wish to do a more in-depth adaptation to the exact capabilities of the system, but many crates just want a single number for "how many threads should I spawn", and it makes sense to have a portable API to return that number, without making an application have to directly deal with the system-specific nuances that might go into such a number.

The purpose of the available_concurrency() call is not to determine the number of CPUs on the system; it's to determine a reasonable default for the amount of concurrency to use. I did see your comment above about sched_setaffinity; however, if someone uses sched_setaffinity to limit the number of CPUs the process can run on, it's correct for available_concurrency() to return a correspondingly smaller number of CPUs, so that the process defaults to running a thread for each CPU that its affinity allows it to run on.

If you feel there's some additional nuance this function could be detecting to provide a better default, I'd be happy to review further pull requests that detect (for instance) additional aspects of cgroup configuration. But I do believe we should encourage people to use an API abstraction like this, to provide a reasonable default.

raphaelcohn commented 3 years ago

@joshtriplett I couldn't disagree more. Portability by compromise is a poor goal; most Rust developers trust the standard libraries to produce performant code (something I myself used to believe). However, I'm not going to hold this up. If this change goes in, then at the very least, could we please caution those that use it that is a simplistic solution and highlight that the underlying implementation on Linux is brittle.

Lastly: "if I feel". This is clumsy expressed and leads to a demonstration of power, not equality; clearly I do feel this way. It's clear that you want to make this change, that you have the power to make this change, and you have a majority view. I have no interest in engaging further.

yoshuawuyts commented 3 years ago

Follow up to https://github.com/rust-lang/rust/pull/74480#issuecomment-660557465

I had a chat today with @sivadeilra today on how to approach the 64-thread limit on Windows, and whether we could detect more than 64 threads on a system by default. They mentioned that in order to do that we need to change the ambient state of a program, which a lookup function probably shouldn't do.

One option @sivadeilra mentioned we could explore is adding an API via std::os::windows::* to explicitly change the ambient state. We should probably prototype this on crates.io first though.


Follow up to https://github.com/rust-lang/rust/pull/74480#issuecomment-662129901

While looking through the discussion on the original PR I also realized that we weren't tracking https://github.com/rust-lang/rust/pull/74480#issuecomment-662129901. It mentions that on Windows might be able to limit the amount of concurrency available to a program via process affinity masks / job objects. @sivadeilra do you have a sense for how we could support that? Would getting those values also interact with the ambient state?

joshtriplett commented 3 years ago

@raphaelcohn FWIW, my intent was precisely the reverse. I was trying to explicitly invite further refinement to the implementation, to avoid the implication that it shouldn't account for target-specific complexity when it can.

yoshuawuyts commented 2 years ago

The API, documentation, and naming have all undergone thorough review in the past few months. And with https://github.com/rust-lang/rust/pull/89310 merged, we are now returning accurate numbers for the majority of uses.

The only known limitation at the moment is that we're not detecting externally-set thread count limits on Windows. While not unimportant, there is nothing in the API or documented guarantees that prevents us from improving accuracy later on.

It's been well over a year since work on this started, and I think we've reached a point where we are ready to propose stabilization. @joshtriplett, would you be able to start a libs FCP?

the8472 commented 2 years ago

The only known limitation at the moment is that we're not detecting externally-set thread count limits on Windows.

CFS bandwidth limits (commonly used to limit containers) have been mentioned several times. I consider that a significant limitation and #89310 does not address that since bandwidth limits and affinities are independent ways to allocate CPU resources.

yoshuawuyts commented 2 years ago

@the8472 Thanks for raising that, that makes sense! I didn't realize Linux had two distinct ways of restricting CPU resources. Just to make sure we're entirely on the same page; what you're saying we're still missing is an equivalent to num_cpu's cgroups_num_cpus logic?

the8472 commented 2 years ago

Yes, although that particular implementation is debatable since it caches the calculation even though quotas can be changed at runtime.

joshtriplett commented 2 years ago

@the8472

CFS bandwidth limits (commonly used to limit containers) have been mentioned several times. I consider that a significant limitation and #89310 does not address that since bandwidth limits and affinities are independent ways to allocate CPU resources.

That's absolutely true. However, looking into it further, I'm not actually sure the degree to which either CPU shares or CFS bandwidth limits should affect available_parallelism, and in particular I'm not sure the semantic implemented by num_cpus is exactly what we want.

(By contrast, I do think we should make sure to account for cgroup CPU sets.)

Talking it through for each of the types of cgroup restriction:

There's also a general principle that I would tend to favor here: in the absence of a clear answer that's a winner much more often than it isn't, having simpler and more predictable behavior seems preferable. When people set CFS bandwidth limits, they expect that to affect the scheduler; it's not entirely obvious that they would also expect random applications to read those limits and run fewer threads than they otherwise would have. In particular, if you want an application to run on fewer threads, you might want cpusets or affinity, rather than CFS bandwidth limits; if you've set CFS bandwidth limits, perhaps you want to run on all CPUs for a fraction of bandwidth, rather than a fraction of CPUs for all their bandwidth.

Given all of the above, I personally would favor not taking CFS bandwidth quotas into account by default, in the interests of optimizing for the "finish fast" case or the "handle bursts of requests in parallel" case, and in the interests of simplicity (especially in the face of hierarchical quotas).

Note: I'm happy to start the FCP on this once it settles, but I don't want to do so after just having presented a new argument for consideration.

the8472 commented 2 years ago

That's absolutely true. However, looking into it further, I'm not actually sure the degree to which either CPU shares or CFS bandwidth limits should affect available_parallelism, and in particular I'm not sure the semantic implemented by num_cpus is exactly what we want.

Prior art:

cpusets are directly a limitation on affinity, and I think it's appropriate for available_parallelism to respect them.

But they require static allocation of individual cores, that's why the default strategy of container runtimes is to allocate based on quotas or shares and let the kernel scheduler hash out the details.

CPU shares limit what percentage of the CPU you get, and effectively throttle your process down proportionally if other processes are also using the CPU, while allowing you to burst past that if the CPU isn't busy. Given that if the CPU isn't busy you're allowed to saturate it, I don't think this should affect available_parallelism.

Agreed, shares shouldn't factor into it.

On the other hand, if you expect to do a CPU-bound computation for a relatively short amount of time, potentially less than the quota, then running on all 8 CPUs may let you finish your computation four times faster. And if you're doing non-CPU-bound work, such as responding to HTTP requests, running on all CPUs may give a burst of requests lower overall latency because there are more CPUs to respond to requests concurrently, as long as you don't saturate your CPU bandwidth.

That seems like a fragile situation. If you want to finish fast that only works as long as long as long as the combined workload, including the increased synchronization overhead from running more threads and the decreased throughput from decreased turbo clocks, stays below the bandwidth limit. As soon as you hit that you fall off a latency cliff instead of gentle degradation. If you care enough about latency that you distribute the work of individual requests over multiple threads you're probably not going to blindly apply CFS bandwidth limits or you might configure your thread pools manually based on benchmarks.

And if you're doing non-CPU-bound work, such as responding to HTTP requests, running on all CPUs may give a burst of requests lower overall latency because there are more CPUs to respond to requests concurrently, as long as you don't saturate your CPU bandwidth.

If you have a threadpool for blocking IO tasks then setting it to the available CPU cores is rarely a good metric anyway since IOPS, external services or whatever will be the limiting factor, not available cores. Even on a single-core machine you'll want more threads to keep io queues full. Imo that's not a good argument against factoring quotas into available_parallelism.

CFS bandwidth limits don't have bursting,

They optionally do now with cfs_burst_us. Granted, this hasn't trickled down to containers yet.

joshtriplett commented 2 years ago

@the8472 The future availability of bursting makes me more inclined to think we shouldn't limit available_parallelism based on CFS bandwidth.

I'm less concerned about the case of many HTTP request/response threads, and more concerned about the case of CPU-bound computation that can finish within one quota period (potentially including bursting).

the8472 commented 2 years ago

I'm less concerned about the case of many HTTP request/response threads, and more concerned about the case of CPU-bound computation that can finish within one quota period (potentially including bursting).

That concern seems quite theoretical to me. It falls apart in several related ways.

(1) On a system where CPU quotas are enforced to a degree where a service gets a much smaller slice than the available hardware threads that usually is because there are other services doing their own work, which means your service may not even benefit from spreading its work over many threads because it may end up competing with the other services.

(2) When you optimize for latency you cannot just optimize for the minimum latency. Even the 99th%ile latency of individual work items can end up determining the user-perceived overall latency. Now let's consider what happens when you distribute work over all hardware threads. The minimum latency can go down assuming the work is parallelizable. But once it begins working on the Nth item and the quota runs out all the threads will be stuck and that request will be stalled for the rest of the quota period. Which drives up tail latencies. Even when your request queues are otherwise empty (e.g. when the system auto-scales based on queue size). That's because by the time that the threads get frozen the item is already popped off the work queue. To lower the variance of latencies one has to make sure that threads don't get frozen and one does that by only starting as many threads there is CPU quota available.

(3) even ignoring the increase in tail latencies in (2) median latencies will also go up under load because it's keeping more cores active, increasing IPC overhead and decreasing available turbo clocks. Competing with other other services (1) will also lead to more involuntary context switches and cache pressure. Lower throughput for the same amount of spent core-seconds -> higher latencies

(4) The scenario only makes sense if the whole system is almost always under-utilized AND work arrives at a very predictable rate (if it's a poisson process then random queue depth spikes can screw you over via (2)). But in such a situation you wouldn't need quotas in the first place.

yoshuawuyts commented 2 years ago

It seems like the conversation has reached a standstill once more. There appear to be two opposing, well-argued for positions, and it doesn't appear we'll automatically come to a resolution. The way I see it, we now have several options to choose from:

  1. We punt the discussion on the implementation detail, and move forward to FCP with the implementation as-is. What's currently being discussed does not affect the end-user API, and can freely be adjusted post-stabilization as well.
  2. @the8472 files a PR implementing their proposed changes, and once we've merged that we move to FCP. Similar to option 1, we're at liberty to change the underlying implementation as long as the API and documented guarantees don't change.
  3. In the case neither of these options seem acceptable, the libs team schedules this as a discussion topic on the agenda of one of the sync meetings, with the explicit goal of figuring out to proceed.

@joshtriplett @the8472 I'd love to hear from you!

the8472 commented 2 years ago

Well, I'm fine with 1 or 2 as long as adding the cgroup quotas is an acceptable change itself. But that acceptability is where the disagreement with josh lies I believe. If we need a more complex API that distinguishes between steady-state and burstable parallelism or perhaps fractional CPU resources then maybe the current API shouldn't be stabilized.

joshtriplett commented 2 years ago

@the8472 That's a reasonably convincing argument, thank you. I do think there are workloads (especially with bursting) where it makes sense to use all available CPUs, if you expect the load on services on the same machine to be mostly uncorrelated. However, I find your point about the 99% latency sufficiently compelling to suggest that that should be the default behavior. (We should document that behavior clearly, though.)

So, I'd be on board with the plan of:

joshtriplett commented 2 years ago

@rust-lang/libs-api, shall we stabilize std::thread::available_parallelism(), and also plan to merge future enhancements to it to support cgroup bandwidth limits?

@rfcbot merge

rfcbot commented 2 years ago

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

Amanieu commented 2 years ago

The main question that this API should answer is: how big should I make my thread pool for CPU-bound tasks. In this case it's better to overestimate (costs memory) than underestimate (missing out on potential parallelism). Essentially, this function should return what is the maximum potentially available parallelism.

the8472 commented 2 years ago

What kind of underestimate are you worried about?

cuviper commented 2 years ago

In this case it's better to overestimate (costs memory) than underestimate (missing out on potential parallelism).

Overestimating also costs performance, like cache contention and kernel scheduling overhead.

strohel commented 2 years ago

Overestimating also costs performance, like cache contention and kernel scheduling overhead.

To support this with a data point: earlier I benchmarked a simple actix-web based microservice. Overestimating the number of worker threads (from 1 to 4) resulted in ~50% increase of CPU time per request.

I can re-run the benchmark with up-to-date versions and different parameters if requested.

rfcbot commented 2 years ago

:bell: This is now entering its final comment period, as per the review above. :bell:

the8472 commented 2 years ago

91057 updates the documentation in anticipation of cgroups support.

rfcbot commented 2 years ago

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

jethrogb commented 2 years ago

How was I, as a platform maintainer, supposed to know that this platform-specific functionality was added and an implementation would be needed for the platform I maintain, prior to reading about this in the release notes?

yoshuawuyts commented 2 years ago

@jethrogb I'm sorry to hear you feel you should've been notified about this, but weren't. Perhaps there's indeed room for improvement here. The best place to discuss this would likely be on the libs team stream on the rust-lang Zulip. Additionally, some existing ways of tracking project changes such as these are:

I hope you find this helpful!

yaahc commented 2 years ago

How was I, as a platform maintainer, supposed to know that this platform-specific functionality was added and an implementation would be needed for the platform I maintain, prior to reading about this in the release notes?

We've been discussing this in the libs team meetings recently, specifically the relationship between the libs teams and various platform maintainers and what kind of support we should be providing. The most recent discussion was around specific platforms and making sure that, among other things, every platform we support has a ping group so we can notify them of relevant PRs. We didn't discuss issues that affect all platforms but it seems like this is a natural extension of that conversation, and that we'd likely also need a ping group to notify all maintainers of tier3 and tier2 platforms whenever new APIs are introduced that need to be implemented by platform maintainers.

I've gone ahead and nominated this for discussion in our meeting next week. If you have any other suggestions or feedback on these plans please note them in the zulip stream for our upcoming meeting: https://rust-lang.zulipchat.com/#narrow/stream/259402-t-libs.2Fmeetings/topic/Meeting-2022-03-02, and if you're able to please feel welcome to attend the meeting and participate in the discussion when it comes up.