Dealing with {dead,live}locked applications

http-rs / tide

Fast and friendly HTTP server framework for async Rust

https://docs.rs/tide

Apache License 2.0

5.03k stars 323 forks source link

Dealing with {dead,live}locked applications #264

Open Nemo157 opened 5 years ago

Nemo157 commented 5 years ago

This is mostly an application design issue, but maybe there are ways in which Tide can help users to design their application, or provide middleware that can detect/help resolve issues.

There's a few ways an application can get itself {dead,live}locked, e.g. deadlocking while locking multiple mutexes from the application State, panicking while holding a mutex such that all future requests instantly error with a PoisonError.

A common way to deal with this in todays landscape of service supervisors is to have a "health check" endpoint which verifies whether the service is currently usable, then a supervisor (such as Kubernetes) uses this endpoint to check if an instance of the service should be taken out of rotation and restarted. Tide could potentially provide pieces of a solution like this to make it easier for developers, or include examples of how such an endpoint could be designed to guide developers.

(split off from https://github.com/rustasync/tide/issues/263#issuecomment-496177676)

prasannavl commented 5 years ago

I think it'd be really cool, if we could have it and be part of an even larger middleware that provides more common instrumentation, including custom ones (Similar to https://golang.org/pkg/expvar/)

mehcode commented 5 years ago

Since we're talking about PoisonError and parking-lot, as a PSA, please look to futures::lock::Mutex. std::sync::Mutex is under the std::sync umbrella and is for synchronization between threads.

Usage becomes:

// in request
let the_thing = cx.state().the_thing.lock().await;

No PoisonError to be concerned with.

This is probably out of scope of Tide and more of something that Rust should be doing but IO-blocking behavior in an async fn can be a huge footgun (as its not always obvious). Perhaps some kind of instrumentation that debug rust builds can do to issue runtime warnings at least.

Nemo157 commented 5 years ago

No PoisonError to be concerned with.

Unless the waker panics when attempting to wake another task if we abandon acquiring the mutex, that will poison the std::sync:::Mutex protecting the list of wakers.

Much lower scope of poisoning since futures::lock::MutexGuard doesn't poison when dropped during a panic, but there are still possible situations that can poison a futures::lock::Mutex and result in live-locking all future requests that rely on it.

Probably would still be good to have a better example of live-locking that doesn't involve such unlikely situations/using incorrect synchronization primitives (futures::lock really could be futures::sync, it contains utiliites for synchronization between tasks).

mehcode commented 5 years ago

Unless the waker panics when attempting to wake another task if we abandon acquiring the mutex, that will poison the std::sync:::Mutex protecting the list of wakers.

Fair enough. I had meant you don't need to deal with the potential error return but it seems its a panic instead.. not sure I like that. I didn't review the lock that closely.

Probably would still be good to have a better example [...]

Totally agree that good examples should be done. I do feel Mutexs in general (even using a futures-aware mutex) in async code is weird. You generally want a lockfree data structure if anything, state is bad (tm) and all that.

futures::lock really could be futures::sync, it contains utiliites for synchronization between tasks

It was in futures v0.1. Not sure the reasoning to rename the module in 0.3.

yoshuawuyts commented 5 years ago

Notes from triage: this is not a priority at the moment, but we do think we'll eventually want to revisit this. We're keeping the issue open to track the discussion.

skade commented 4 years ago

While I agree with the general notion, I believe the application should not continue running and wait for an external restart. It should immediately crash, letting a supervisor restart it.