See #494. Today, if you evict a workflow that has incomplete coroutines, we simply delete it from the workflow map. This means Python will garbage collect these coroutines at some point in the future. Garbage collection of coroutines involves throwing a GeneratorExit within them, which can cause the coroutine to wake up on any thread to handle this GeneratorExit. Therefore, finally may execute in another thread which may be running another workflow at the time. This is very bad and despite all attempts, we cannot reasonably intercept Python's coroutine garbage collection or the GeneratorExit behavior here.
So we have refactored the eviction process to cancel all outstanding tasks and to ignore or raise during any side effects attempted (e.g. commands). This is similar to other SDKs that have to tear down coroutines. However there are cases where a user may have done something invalid and the cancel may not complete the coroutine. This will log an error, hang the eviction, and will forever use up that task slot. It will also prevent worker shutdown. We can discuss ways to improve this if needed.
This supersedes #325/#341 which we previously thought was good enough to handle GeneratorExit.
What was changed
See #494. Today, if you evict a workflow that has incomplete coroutines, we simply delete it from the workflow map. This means Python will garbage collect these coroutines at some point in the future. Garbage collection of coroutines involves throwing a
GeneratorExit
within them, which can cause the coroutine to wake up on any thread to handle thisGeneratorExit
. Therefore,finally
may execute in another thread which may be running another workflow at the time. This is very bad and despite all attempts, we cannot reasonably intercept Python's coroutine garbage collection or theGeneratorExit
behavior here.So we have refactored the eviction process to cancel all outstanding tasks and to ignore or raise during any side effects attempted (e.g. commands). This is similar to other SDKs that have to tear down coroutines. However there are cases where a user may have done something invalid and the cancel may not complete the coroutine. This will log an error, hang the eviction, and will forever use up that task slot. It will also prevent worker shutdown. We can discuss ways to improve this if needed.
This supersedes #325/#341 which we previously thought was good enough to handle
GeneratorExit
.What changed:
disable_safe_workflow_eviction
which, if set toTrue
, will perform the old behavior of letting GC collect coroutinesChecklist