njsmith commented 7 years ago

Our Lock objects enforce the constraint that they can only be released by the same task that acquired them. ~~We might (probably should) add a version of Semaphore that enforces the same constraint (see discussion in #156).~~ CapacityLimit objects are like Semaphores that enforce the same constraint. We should take advantage of these constraints to provide better error detection. In particular:

Detecting never-released locks

This is pretty trivial, yet useful: if a Task exits while holding a Lock, then this is a bug (that Lock can never be released), and we should alert the user. (I guess technically it's not a bug if this is the only Task that knows about the Lock, but I don't think requiring people to issue a spurious release in this corner case is too onerous if it helps catch errors in general.) This would be pretty straightforward – just keep a record of the lock-or-lock-like objects held by each Task, and then when a task exits, if this set is non-empty, whine loudly.

Not entirely sure if this means "print a warning" or "raise an exception". There's also the option of forcibly releasing the lock, though I'm dubious about the wisdom of carrying on once your distributed system has lost internal consistency... maybe poison the lock so anyone who waits on it gets an exception, and if people are following good Erlang style then they'll all get rebooted and all may be well?

Full-fledged deadlock detection

We can also detect classic deadlocks by tracing the wait-for graph.

This scope would be limited, e.g. we can't detect deadlocks caused by two Tasks that are using two Queues for communication and are both blocked, because we don't have any way to know that that task there that's blocked is the only one that can send on this Queue. Maybe we should! (E.g., by having the concept of binding a Queue to a Task so only that Task can send on it.) But even if all we can detect are mutex-based deadlocks that's still potentially very useful!

There are a lot of possible variants on this. We don't want to add too much overhead; I don't know if tracing the wait-for graph on every contended acquire is fast enough to be acceptable. (It worked for these people, but that's a pretty different context.) It's probably a reasonable thing to try first, though. (One can also do stuff like trigger a trace only after a task has been blocked for N seconds. Though in practice keeping track of this on every contended acquire might be just as expensive as tracing the graph immediately, if the graph trace usually finishes almost immediately.)

Another thing that can be useful to detect is we see one Task acquire A-then-B, while another Task acquires B-then-A. This by itself doesn't prove that there's a deadlock, but it's certainly suspicious, and can help catch latent bugs before they hit.

Again the basic data needed for all of these is the ability to look at a Task and get a list of all the lock and lock-like objects its waiting on.

njsmith commented 7 years ago

This library also has an extremely simple feature: print a stacktrace whenever taking a lock takes too long, period. Which is surprisingly reasonable...

njsmith commented 7 years ago

In working on #181 (and #156, sorta), I realized that threads make this a bit more complicated. In particular, here's the problem I ran into: in run_in_worker_thread, the task that's about to start the thread needs to acquire a token, and then the token has to be released after the thread finishes... which might not be until after the original task has abandoned this and gone off to do something else! So the task releasing the token may not be the same task that acquired it.

Conceptually, I think the way we would want to model this for deadlock-detection purposes is:

tasks and threads are both first-class schedulable entities in the waits-for graph
a task executing run_in_worker_thread is waiting-for the thread
the thread owns the capacity token
a thread executing await_in_trio_thread is waiting-for the task that it spawned

So... I was initially thinking in terms of complicated APIs that let you transfer "ownership" of a lock/token to another task, by materializing it in an object and then passing that object over etc., but this suggests something much simpler. Maybe all we need is like limit.acquire_on_behalf_of(thread_id), limit.release_on_behalf_of(thread_id), where thread_id is an opaque object that will eventually be hooked into the deadlock detection system (or a Task of course). So run_in_worker_thread would do something like:

thread_id = new_thread_id()
# While blocked here, this task is *waiting-for* the limit
# But afterwards, the thread_id *holds* the limit
await limit.acquire_on_behalf_of(thread_id)
spawn_thread_with_id(...)

(If we start re-using threads, #6, then the "thread id" here would become more like a "job id", i.e. it refers to a particular chunk of work done by a thread corresponding to a single call to run_in_worker_thread, not that thread over all time.)

And eventually we'd want to stash the thread_id in a thread-local inside the worker thread, so that {run,await}_in_trio_thread can get at it for deadlock detection.

smurfix commented 4 years ago

While #1085 will take care of global deadlock detection, real-life code probably needs some more intelligence than that, as otherwise no server will ever have a deadlock (its accept call is not going to be affected by the deadlock; also there's the systemd keepalive task).

I don't think that that any timeout-based solution is sufficient, but some "this lock stays locked for more than N seconds" instrumentation (restarting the timer when it's immediately re-acquired by a task that's waiting on it) on selected Locks might be a good first step that should be implementable without much overhead.

python-trio / trio

Deadlock detection #182

Detecting never-released locks

Full-fledged deadlock detection