rust-lang / wg-async

Working group dedicated to improving the foundations of Async I/O in Rust
https://rust-lang.github.io/wg-async/
Apache License 2.0
379 stars 88 forks source link

Detect and prevent blocking functions in async code #19

Open betamos opened 3 years ago

betamos commented 3 years ago

In order to write reliable and performant async code today, the user needs to be aware of which functions are blocking, and either:

Determining if a function may block is non-trivial: there are no compiler errors, warnings or lints, and blocking functions are not isolated to special crates but rather unpredictably interspersed with non-blocking, async-safe synchronous code. As an example, most of std can be used in async code, but much of std::fs and std::net cannot. To make matters worse, this failure mode is notoriously hard to detect: it often compiles and runs fine when the executor is under a small load (such as in unit tests), but can cause severe application-wide bottlenecks when load is increased (such as in production).

For the time being, we tell our users that if a sync function uses IO, IPC, timers or synchronization it may block, but such advice adds mental overhead and is prone to human error. I believe it's feasible to find an automated solution to this problem, and that such a solution delivers tangible value.

Proposed goal

Possible solutions

I am not qualified to say, and I would like your thoughts! A couple of possibilities:

Challenge 1: Blocking is an ambiguous term

Typically, blocking is either the result of a blocking syscall or an expensive compute operation. This leaves some ambiguous cases:

Ambiguity aside, there are many clear-cut cases (e.g. std::thread::sleep, std::fs::File::write and so on) which can benefit from a path forward without the need for bike-shedding.

Challenge 2: Transitivity

If fn foo() { bar() } and bar() is blocking, foo() is also blocking. In other words, blocking is transitive. If we can apply the annotation transitively, more cases can be detected. OTOH, this can be punted for later.

Challenge 3: Traits

When dynamic dispatch is used, the concrete method is erased. Should annotations be applied to trait method declaration or in the implementation?

Challenge 4: What is an "async context"?

The detection should only apply in "async contexts". Does the compiler already have a strict definition for that? Examples from the top of my head:

Background reading

LucioFranco commented 3 years ago

While, I think this idea has some merit to it. I don't think its the right path to take. It forces us to encode things like what is blocking, what is an async context when in reality we have no abstract way to really do this.

To also note, the challenges of things like TcpStream::write blocking sometimes and sometimes being non-blocking by setting a specific socket flag. Not sure, how we could encode this into the type system.

This is a big reason, why #[must_not_await] is a lint and not a hard error. We want to mostly make the user aware and let them make the choice based on the situation they are in.

I think before we can move ahead with something like this we need to as a community define:

I think these questions are going to be quite hard to answer, since each runtime kinda does their own thing.

I personally, think a better approach is to not try to tackle this from a language perspective but an education perspective. Why do users make this mistake? How can we teach them properly about blocking? Maybe that includes a very simple lint that links to docs?

betamos commented 3 years ago

Thanks for the feedback! I'll clarify my perspective briefly, and leave the rest for further discussion.

To also note, the challenges of things like TcpStream::write blocking sometimes and sometimes being non-blocking by setting a specific socket flag.

Thanks, added to the list of ambiguous cases.

Not sure, how we could encode this into the type system. [...] This is a big reason, why #[must_not_await] is a lint and not a hard error. We want to mostly make the user aware and let them make the choice based on the situation they are in.

To be clear, I'm not suggesting type system changes. A lint was my initial thought too. Another idea is at-runtime executor and debugging tools that can report latency spikes.

It forces us to encode things like what is blocking [...] I think before we can move ahead with something like this we need to as a community define [...]

Emphasis added to highlight the following point: Currently, users have to make this distinction on a case-by-case basis. Our (Fuchsia's) internal research shows that async alone adds significant complexity, even more so than e.g. lifetimes and the type system. Unlike those, making mistakes with blocking has an almost non-existent feedback system - no compiler errors, no test failures. In short, users are on their own.

I personally, think a better approach is to not try to tackle this from a language perspective but an education perspective.

Education is certainly part of the solution (it's something I'm actively working on), but I argue education and automated tooling are not mutually exclusive. Just like lifetimes are thoroughly covered in learning material and have highly actionable and precise compiler errors.

ids1024 commented 3 years ago

Ambiguity aside, there are many clear-cut cases (e.g. std::thread::sleep, std::fs::File::write and so on) which can benefit from a path forward without the need for bike-shedding.

Disk IO seems to be a clearly blocking operation. But... what if it's a read/write in somewhere like /sys or /proc? I don't think that should be "blocking" unless all system calls are "blocking".

It probably is good to have tooling to catch potential issues like this, perhaps in something like clippy rather than the compiler. But it is fairly complicated to determine what should trigger the lint.

tmandry commented 3 years ago

I think this is pretty much guaranteed to be a lint, since it will be best effort: catching all cases of a blocking function being called would be impractical. The type system is not sufficient for tracking something like this, as @LucioFranco pointed out.

As for what an "async context" is, I think the simplest answer is probably the correct one. I suggest we focus on calls that happen directly in an async fn, async block, and possibly the poll fn of a Future impl.

This also means the lint would be significantly more useful with the ability annotate user functions with an attribute, like #[blocking], that would activate the lint when they are called in an async context. Programmers can then use contextual information (like how sockets are configured) when deciding whether or not to use an annotation.

We could prototype something like this in clippy with a hardcoded list of functions to warn on. That would be useful, but I don't think we could implement something like the #[blocking] attribute in clippy.

tmandry commented 3 years ago
  • What is blocking? How does this change based on executor/use case/OS?

I think different programs (and libraries) will have different "degrees of caring," where for example I might care about disk I/O but not println. On the other hand, some latency-critical applications might care a lot about println. I think we may want individual lints for different categories, with a few "umbrella" lints you can allow/deny all at once.

Disk IO seems to be a clearly blocking operation. But... what if it's a read/write in somewhere like /sys or /proc? I don't think that should be "blocking" unless all system calls are "blocking".

If you want errors on disk I/O (you've enabled the blocking_io lint, say,) then you can always #[allow(blocking_io)] in certain places where you know what you're doing.

yoshuawuyts commented 3 years ago

I think different programs (and libraries) will have different "degrees of caring," where for example I might care about disk I/O but not println.

This seems true only for as long as the alternative is harder to use. If the stdlib included both blocking and non-blocking println implementations, recommending to always use the non-blocking version would be less controversial. I'm not sure how much we'd gain by introducing multiple lints.

An issue I've seen crop up several times with async-std users is confusion around using thread-related functionality inside async blocks. Having had a #[blocking] attribute would've warned people they were doing something that they probably shouldn't, which would've saved them time and frustration.

The following functionality I think would be important to apply the attribute on, both in the stdlib and as a recommendation for library authors:

Algorithms seem too hard to categorize broadly and should always be manually assessed.

Overall I'm very much in favor of introducing such a lint; I think this will be incredibly helpful for people working in async Rust!

LucioFranco commented 3 years ago

Disk IO seems to be a clearly blocking operation.

I would argue that actually calling File::read is totally fine in an async executor. Generally, its quite bounded, the issue is mostly unbounded blocking.

any form of non-async locking

Using non async locking is a totally normal thing to use in async applications but lets say for example I add async locking, then I add a #[allow(blocking)] to remove the warning because I this is correct. Now since we can't apply the lint to the specific line only to the module. If I add a File::read or some other blocking operation I will no longer get the benefits of this lint.

My general worry is people leaning too heavily on this. I feel like its okay to lean heavily on must_use since that generally works quite well but in this situation it doesn't always work well.

I personally think effort should go into documentation around this and infrastructure to help runtimes detect/expose slow futures that may be doing blocking. The lint to me personally feels a bit costly for the benefit, where as I think we could potentially invest in other areas with much better return.

Education is certainly part of the solution (it's something I'm actively working on), but I argue education and automated tooling are not mutually exclusive.

I agree, though we do have limited time and resources. Maybe, its just me who doesn't see this being as useful. But I think I got that point across enough now.

Happy to move forward with this if people generally think its useful. I personally plan on exploring how as a runtime we can expose these implicit relationships.

tmandry commented 3 years ago

Here's a relevant blog post by @Darksonn, which suggests a rule of thumb of spending no more than 10-100 usec between await points. (It also says that it depends on your application, but I think it's an interesting guideline nonetheless.)

https://ryhl.io/blog/async-what-is-blocking/

lahwran commented 3 years ago

My thinking through this: if I were writing a tool to analyze my async code, I'd want to be able to do something along the lines of querying the entire function call tree for any call that could potentially take too long for its context, where the deadlines for a given context vary based on the executor in use. So, I'd want a way to run some sort of worst-case profiler - what are the worst plausible cases for the functions I ran?

For some of these functions, such as reading from the filesystem or the network, the runtime be not be bounded by the rust code at all, where a potentially arbitrarily large amount of external work may need to occur. For those, I'd want to ensure they never get called in an async context, full stop, and I'd want that kind of function tagged with something like the name #[may_block] and then ideally scanned recursively through the call tree.

For other functions, such as ones that do context-dependent amounts of computation inside the rust process, I'd want a way to characterize the timing variation of that function so that I can ensure it won't surprise me with long runtime. That could potentially get as complex as a static analysis which attempts to prove a bound on the function's runtime and then warn me if the bound is not reliably short enough or warns if it was unable to prove halting. However, that's proposing a rather large amount of work - work which would likely be worth it, especially with Z3 in the world to make it possible, but which I certainly don't have the available time to figure out and do at the moment.

I would argue that actually calling File::read is totally fine in an async executor. Generally, its quite bounded, the issue is mostly unbounded blocking.

This seems like it might support a case for providing a field in the attribute to specify the warning message. I agree that File::read() would usually be fine, but how fine it is varies based on read size and where you're reading from, and it's not hard to get yourself into a situation where it's not fine and async or threaded file io is needed (of course, last I checked, it's not trivial to set up a filesystem that will support async file io!).

I think ultimately the sort of tool you need to solve this problem needs to be more powerful than local compiler lints, but it could go a long way to tag functions which are known to have high latency-worst-cases. Come to think of it, what if the lint allowed specifying estimated best case and worst case bounds? sketching an unreasonable extreme (I forget what the limitations of attribute syntax are - as flexible as a macro, iirc?):

#[may_block(
    milliseconds: "fastest I've ever seen it run", 
    unbounded: "may potentially take as much as 5 minutes if dumbsynclib's internal connection to irc times out")]
fn expensive_and_unpredictable() {}

#[may_block(microseconds: "simple cases", milliseconds: "bounded by max dependency list size")]
fn run_dependency_solver() {}

// or maybe only an upper bound:
#[may_block(seconds: "constant time cpu bound operation")]
fn calculate_password_hash() {}

#[may_block(milliseconds: "waits for gpu to catch up, guaranteed to be less than 30ms unless driver crashes")]
fn sync_gpu() {}

then in fantasyland where we can add this complexity to allow(), in main.rs, specify #[allow(may_block(any, microseconds))] to allow any upper bound up to microseconds, and in an async function where you're using something that internally calls File.read(), specify #![allow(may_block(microseconds, unbounded))] to state that the lower bound must stay low but the upper bound is currently irrelevant. Or something. Realistically I'm guessing this would need severe whittling before it could be workable.

betamos commented 3 years ago

I committed to @yoshuawuyts to add an RFC for this issue. Looking at it again, I would ideally to see this prototyped in Clippy beforehand, in order to get a feel for the workflow and build a stronger case. An MVP, in my mind, would work the following way:

Basically, such an MVP would have close to zero false positives. I tried a while ago but got sidetracked before I had a working solution. If anyone has a concrete strategy for how to parse the "immediate async context" I'd be delighted! I'm also down to discuss and review anything on this topic.

betamos commented 3 years ago

Oh, and @tmandry just informed me that there's already an open issue in Clippy for this.

ibraheemdev commented 2 years ago

C# lints when there are no awaits in an async function.