Open njsmith opened 5 years ago
Minor tricky bit: our SIGINT
handler needs to always hold a run_sync_soon
handle. But I think we're ok: technically it is sort of true that no program is really deadlocked because the user could always hit control-C, but in practice if the only way to wake up your program is to hit control-C, then everyone considers that deadlocked. So we want to ignore this handle. And that's easy, because it's buried inside trio/_core
anyway, so it can cheat and use a special handle that gets left out of the accounting.
wait_all_tasks_blocked
, the autojump clock, and a global deadlock detector; see in particular https://github.com/python-trio/trio/issues/456#issuecomment-499750807I have a branch to experiment with splitting run_sync_soon
off into a dedicated object that you open and close. The two big issues I've found so far:
BlockingTrioPortal
assumes that TrioToken
s are freely available, can be passed around, don't need closing, etc. So it's going to need to be reworked. We were already considering reworking it (#810), so possibly that's the first thing to do?TrioEntryHandle
(returned by trio.hazmat.open_trio_entry_handle()
). This is a super clunky name. Surely there's something better?Some more interesting issues that occurred to me, around the interaction between deadlock detection and run_sync_soon
:
pytest-trio
should eventually get the feature of setting a timeout on your tests, and cancelling the test if the timeout is exceeded (https://github.com/python-trio/pytest-trio/issues/53), in order to avoid hanging from livelocks. This needs to use the system clock, not the trio clock, because you want the timeout to still work even when combined with the autojump clock. The simplest way to do this is to use a background thread. This background thread needs the capability to call back into Trio, to cancel things after the timeout expires. BUT, this should not prevent the deadlock detector from working! So pytest-trio
needs a version of run_sync_soon
that doesn't block the deadlock detector.
It's not ideal, but sometimes you need to use a __del__
method to trigger something inside Trio. This requires run_sync_soon
. Again though, the existence of an object like this should not count as preventing deadlock – esp. since in most cases, if your program is deadlocked, the GC will probably never run again, so in fact the __del__
method can't be called. (And even if it could, then it would be super weird design to have the __del__
method be the thing that enables actual forward progress! Most likely it would just create a task that ran for a few moments to close stuff, and then exit and we'd be back in the deadlock.) So again, we probably need a version of run_sync_soon
that doesn't block the deadlock detector.
__del__
that basically works like described above (see #265)OTOH, trio.to_thread.run_sync
is like the canonical example of a case where we do want run_sync_soon
to block the deadlock detector. But, even here...... if the thread uses trio.from_thread
to re-enter Trio, then that means it's actually temporarily re-entering the zone where the deadlock detector can see what's happening. So ideally we should like... temporarily re-enable the deadlock detector, and then block it again when the trio.from_thread
call returns. Subtle!
A simple solution would be to have a flag you pass when creating a TrioEntryTicket
or whatever we call it, that controls whether this particular ticket blocks the deadlock detector or not. That would solve the first two cases. The last case is a bit more subtle – maybe it means we need a way to toggle an entry ticket's deadlock-detector-blocking-ness property (DDBNP) on and off?
Alternatively, I think we could use a trick: when you call trio.from_thread
, it could close the ticket after re-entering Trio, and then when it returns to the original thread, it could create a new one and pass it back.
This sounds a bit elaborate, but there's actually some theoretical benefit too: I don't think you can have a correct system that relies on toggling the DDBNP from False -> True from outside Trio. There's an inherent race condition: if you're relying on setting it to True to block the deadlock detector from firing, then how do you know that the deadlock detector won't fire just before you set it to True? But if people have to go into Trio to get a deadlock-detector-blocking-ticket, then that automatically gives you a synchronization point.
Of course, if someone has a ticket with DDBNP = False, they can still use that to enter Trio and get a ticket with DDBNP = True, and that has exactly the same issues as toggling a single ticket in-place.
We don't really need to solve this right now – first we need to get the new entry ticket API in place, and then add deadlock detection, and then we can worry about whether toggling this is important.
OK better idea!
Let's add inhibit_deadlock_detector=True/False
as an argument to wait_task_rescheduled
(default False
). When you set it to True
, it means that this task might wake up from stuff that Trio can't see, so the deadlock detector shouldn't fire.
(Or maybe call it could_deadlock=False/True
?)
Then:
trio.to_thread
can use a regular token that doesn't inhibit the deadlock detector, and instead pass inhibit_deadlock_detector=True
when the task is blocked waiting for the thread; this automatically restricts the inhibition to the right moment
open_signal_receiver
still probably wants an option to let you decide whether signal reception counts as something that can break a deadlock. But if it can, you still properly only want to inhibit the deadlock detector when a task is actually blocked waiting for a signal – not the whole time that the signal receiver exists!
simple __del__
calls automatically don't inhibit the detector
The IOManager code is simpler – they just pass inhibit_deadlock_detector=True
when they block in wait_task_rescheduled
, versus my original design where each IOManager had to reimplement the tracking of whether any tasks were blocked on IO.
So maybe we don't even need to replace TrioToken
? We still probably want some way to manually inhibit the deadlock detector – a TrioToken
can be used externally to spawn a whole new task, not just wake up an existing one. But this seems like a pretty rare event. So maybe we should leave TrioToken
alone, and just document that if you're using it, you should think about whether to use with trio.hazmat.inhibit_deadlock_detector(): ...
or similar.
inhibit_deadlock_detector=True
or maybe external_event
?
Anyway, the concept seems sound enough. However, I'd like to have the opposite option, i.e. the ability to mark a task as non-deadlock-breaking even if it is (or will shortly be) runnable.
Use case: a protocol's keepalive task which runs periodically and sends off an "I am alive" packet. As I understand it, without this option the existence of this task would be sufficient to inhibit deadlock detection.
That's an interesting idea. I haven't thought too much about cases like a keepalive task. For now I've mostly focused on figuring out how to make sure this never gives false positives, since a false positive is "we just blew up your working program" while a false negative is "your broken program could potentially have gotten better diagnostics of how it was broken, but didn't". And I think the biggest value will be for small programs, like simple scripts and for unit tests; I guess large programs will generally have something somewhere that keeps running, even if some subsystem is deadlocked.
But... a small-but-annoying design problem I've been thinking about is, how should we prevent the wakeup-pipe task from blocking the deadlock detector. It's definitely solvable, since this is a special internal task with privileged access to Trio's internals. But right now it just blocks in wait_readable
, and wait_readable
is exposed to users and should normally inhibit the deadlock detector. We could add a secret internal version of wait_readable
that doesn't inhibit the deadlock detector, but it's a bit inelegant to do that multiple times for the different IO managers and then use some hard-coded hack to let the wakeup-pipe task access it. And your suggestion would in fact be a way to solve this.
Another example where this would be useful would be https://github.com/python-trio/urllib3/issues/125, where if you're using an HTTP library that silently uses background tasks to listen for pings on HTTP/2 and HTTP/3 connections, then having ever used HTTP/2 or HTTP/3 in your program could produce a situation where the deadlock detector never fires. (Depending on how patient the server is.) Not very nice!
I guess the implementation could be pretty trivial too... wait_task_rescheduled
already has to mess around with the Task
object. It would be trivial for it to do something like:
async def wait_task_rescheduled(..., inhibit_deadlock_detector=...):
task = current_task()
...
inhibit_deadlock_detector = (inhibit_deadlock_detector and task.can_inhibit_deadlock_detector)
...
After #1588, I think actual deadlock detection could be as simple as:
# in run loop
idle_primed = False
if runner.waiting_for_idle:
...
elif runner.clock_autojump_threshold < timeout:
...
elif runner.external_wakeup_count == 0 and not runner.guest_mode:
timeout = 0
idle_primed = True
idle_prime_type = IdlePrimedTypes.DEADLOCK
...
if idle_primed and idle:
if idle_primed_type is IdlePrimedTypes.DEADLOCK:
# ... crash with deadlock message ...
And runner.external_wakeup_count
is maintained by a tool like trio.lowlevel.not_deadlocked
:
@contextmanager
def not_deadlocked():
task = trio.current_task()
if not task.ignored_by_deadlock_detector:
GLOBAL_RUN_STATE.runner.external_wakeup_count += 1
try:
yield
finally:
GLOBAL_RUN_STATE.runner.external_wakeup_count -= 1
else:
yield
IO primitives do with not_deadlocked(): await wait_task_rescheduled(...)
.
(Maybe it's also convenient allow a not_deadlocked=...
kwarg to wait_task_rescheduled
, as a shorthand? Or whatever name we end up using.)
We would also need to update the autojump clock so that if the autojump is triggered but there is no next deadline, then it runs a version of the deadlock-detected code. Though probably with a custom message.
A minor open question: what to do with open_signal_receiver
. I feel like we want a kwarg to control whether it counts as a external wakeup source (which will need a name), and we also have to pick a default.
This is all looking pretty straightforward. I think the main remaining blockers are getting code to dump the task tree when a deadlock happens (#927), and some way to kill everything with a custom exception (maybe the KeyboardInterrupt
refactors will give us this, e.g. #1537?).
On further thought, going through the new idle_primed
stuff is probably more convoluted than necessary. I think it would be enough to do something like this after all the timeout calculations:
if timeout == inf and not runner.external_wakeup_count and not runner.guest_mode:
# deadlock detected
This also requires moving the timeout = min(max(0, timeout), _MAX_TIMEOUT)
down below all this, and making the default timeout value inf
instead of _MAX_TIMEOUT
.
Ugh this will be so cool and it's so easy, I can taste it. But we need to sort out those blockers first.
We have an issue for fancier deadlock detection, and API support to make it more useful (#182). This is about a simpler issue: detecting when the entire program has deadlocked, i.e. no tasks are runnable or will ever be runnable again. This is not nearly as fancy, but it would catch lots of real-world deadlock cases (e.g. in tests), and is potentially wayyy simpler. In particular, I believe a Trio program has deadlocked if:
IOManager
wait_all_tasks_blocked
(Did I miss anything?)
However, there is one practical problem: the
EntryQueue
task is always blocked in theIOManager
, waiting for someone to callrun_sync_soon
.Practical example of why this is important: from the Trio scheduler's point of view,
run_sync_in_worker_thread
puts a task to sleep, and then later a call toreschedule(...)
magically appears throughrun_sync_soon
. So... it's entirely normal to be in a state where the whole program looks deadlocked except for the possibility of getting arun_sync_soon
, and the program actually isn't deadlocked. But, of course, 99% of the time, there is absolutely and definitely norun_sync_soon
call coming. There's just no way for Trio to know that.So I guess to make this viable, we would need some way to recognize the 99% of cases where there is no chance of a
run_sync_soon
. I think that means, we need to refactorTrioToken
so that it uses an acquire/release pattern: you acquire the token only if you plan to callrun_sync_soon
, and then when you're done with it you explicitly close it.This will break the other usage of
TrioToken
, which is that you can compare them withis
to check if two calls totrio.run
are in fact the same. Maybe this is not even that useful? If it is though then we should split it off into a separate class, so that the only reason to acquire therun_sync_soon
-object is because you're going to callrun_sync_soon
.Given that, I think we could implement this by extending the code at the top of the event loop like:
This is probably super-cheap too, because we only do the extra checks when there are no runnable tasks or deadlines. No runnable tasks means we're either about to go to sleep for a while, so taking some extra time here is "free", or else that we're about to detect I/O, but if there's outstanding I/O then you should probably have a deadline set...