rapidsai / dask-cuda

Utilities for Dask and CUDA interactions
https://docs.rapids.ai/api/dask-cuda/stable/
Apache License 2.0
292 stars 93 forks source link

Increase close timeout of `Nanny` in `LocalCUDACluster` #1260

Closed pentschev closed 1 year ago

pentschev commented 1 year ago

Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to Nanny's internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently calls close() without arguments in SpecCluster._correct_state_internal().

Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for Nanny.close() method with an increased value, and then use the new class when launching LocalCUDACluster via the worker_class argument.

quasiben commented 1 year ago

/merge

pentschev commented 1 year ago

Thanks @quasiben !

csadorf commented 1 year ago

I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?

pentschev commented 1 year ago

I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?

It is possible that this may occur in real use cases, but I've tried locally running the tests that have been failing in Dask-CUDA thousands of times over many a period of many weeks/months and I couldn't reproduce this issue once, which leads me to believe this is due to high-load in CI. In any case, those issues only occur when the cluster is shutting down, so even if it manifests itself for users it could be problematic only in the event that the user relies on the process' exit code, which is indeed not great but unfortunately shutdown is a difficult issue to properly resolve in Distributed.