Closed pentschev closed 1 year ago
/merge
Thanks @quasiben !
I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?
I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?
It is possible that this may occur in real use cases, but I've tried locally running the tests that have been failing in Dask-CUDA thousands of times over many a period of many weeks/months and I couldn't reproduce this issue once, which leads me to believe this is due to high-load in CI. In any case, those issues only occur when the cluster is shutting down, so even if it manifests itself for users it could be problematic only in the event that the user relies on the process' exit code, which is indeed not great but unfortunately shutdown is a difficult issue to properly resolve in Distributed.
Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to
Nanny
's internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently callsclose()
without arguments inSpecCluster._correct_state_internal()
.Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for
Nanny.close()
method with an increased value, and then use the new class when launchingLocalCUDACluster
via theworker_class
argument.