Closed randerzander closed 1 year ago
I'm seeing this on a box w/ 4 T4s, as well as a box with 8x A100s. Doesn't seem machine or card specific.
Could you check the ucx
version that gets installed as well and if besides the Python backtrace there's any segmentation fault/assertion errors/other C backtraces in any of the processes?
Added UCX to the issue desc
Could you then check if there are any other traces as asked above, after that could you try installing ucx=1.13.1
and rerunning?
Actually, before downgrading to ucx=1.13.1
, could you try rerunning with UCX_RNDV_SCHEME=get_zcopy
?
Here's the full trace with export UCX_RNDV_SCHEME=get_zcopy
(w/ ucx 1.14.0) before running the repro script:
https://gist.github.com/randerzander/16e59b627fc8abf1efa349f819eba735
I'm seeing the same thing with UCX 1.14 with and without UCX_RNDV_SCHEME=get_zcopy
This is an issue at shutdown but shouldn't prevent any usage. We think the issue is a change in connection handling in distributed -- bisecting to find out
https://github.com/dask/distributed/pull/7593 is the offending PR.
EDIT: I had pasted the wrong link before.
This is an issue at shutdown but shouldn't prevent any usage.
Unfortunately it does impact usage. Simple things like len(ddf)
work, but more complex jobs (Dask-SQL queries) do not. I'll try to boil it down if that's useful, but it seems like the problem has already been identified?
With non-trivial failures can you locally revert https://github.com/dask/distributed/pull/7644 and see if that passes for you ?
@randerzander Can you try adding:
import weakref
weakref.finalize(lambda: None, lambda: None)
To the top of your script? Must occur before distributed
has been imported
Yes, that resolved both the clean cluster shutdown, and workload failures!
Thanks for the workaround!
@randerzander can you also check if the following works around the problem for you?
def workaround_7726():
from distributed._concurrent_futures_thread import _python_exit
from distributed.client import _close_global_client
from distributed.deploy.spec import close_clusters
import distributed
import atexit
import weakref
def shutdown():
# This mimics a function in distributed.__init__ which we can't
# get a handle on (because it is del'd), and must be registered
# after the other functions
distributed._python_shutting_down = True
# These functions must be unregistered and then re-registered
for fn in [_python_exit, _close_global_client, close_clusters]:
atexit.unregister(fn)
# So that this finalizer is in an atexit hook after them
# Note that atexit handlers are called last-in, first out.
# See https://docs.python.org/3/library/atexit.html
weakref.finalize(lambda: None, lambda: None)
# And re-register them.
for fn in [_python_exit, close_clusters, _close_global_client, shutdown]:
atexit.register(fn)
Run this function before you boot a cluster.
If this solution works I would be inclined to use this rather than put out patch release of distributed
The fact that this has to run before spinning up the cluster suggests this must be implemented by the user. Am I right @wence- ? Did you have success implementing it as part of Dask-CUDA directly (without requiring any user interaction)?
I haven't yet, but we could run it, for example, in dask_cuda.__init__
This works for the minimal reproducer (not sure about cluster restarts):
from dask_cuda import LocalCUDACluster
from distributed import Client
if __name__ == "__main__":
cluster = LocalCUDACluster(protocol="ucx")
client = Client(cluster)
del cluster
print("shutting down...")
print(client)
diff --git a/dask_cuda/local_cuda_cluster.py b/dask_cuda/local_cuda_cluster.py
index 656f614..4781048 100644
--- a/dask_cuda/local_cuda_cluster.py
+++ b/dask_cuda/local_cuda_cluster.py
@@ -23,6 +23,35 @@ from .utils import (
)
+def workaround_distributed_7726():
+ import atexit
+ import weakref
+
+ import distributed
+ from distributed._concurrent_futures_thread import _python_exit
+ from distributed.client import _close_global_client
+ from distributed.deploy.spec import close_clusters
+
+ def shutdown():
+ # This mimics a function in distributed.__init__ which we can't
+ # get a handle on (because it is del'd), and must be registered
+ # after the other functions
+ distributed._python_shutting_down = True
+
+ # These functions must be unregistered and then re-registered
+ for fn in [_python_exit, _close_global_client, close_clusters]:
+ atexit.unregister(fn)
+ # So that this finalizer is in an atexit hook after them
+ weakref.finalize(lambda: None, lambda: None)
+ # And re-register them.
+ for fn in [_python_exit, close_clusters, _close_global_client, shutdown]:
+ atexit.register(fn)
+
+
+workaround_distributed_7726()
+del workaround_distributed_7726
+
+
class LoggedWorker(Worker):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
I can confirm that. The more fragile part seems to be the client must import dask_cuda
, it seems that there's where the problem lies. The consequence of that is the dask cuda worker
CLI will not suffice, but the client code must import dask_cuda
even if it's not used by the client code.
EDIT: That's what I tested:
If import dask_cuda
is commented out in repro.py
, the problem still occurs.
Yes, I think the client connections are the problematic ones :(
Client is problematic in the cases we've seen. I suspect @randerzander 's case may be happening elsewhere too, but not sure either.
This should now be resolved by the distributed 2023.3.2.1 release . Thank you @randerzander for raising!
With a conda environment using the latest nightlies:
Repro:
Trace: