rapidsai / dask-cuda

Utilities for Dask and CUDA interactions
https://docs.rapids.ai/api/dask-cuda/stable/
Apache License 2.0
292 stars 93 forks source link

Failing tests on `distributed>2023.9.2` #1265

Open pentschev opened 1 year ago

pentschev commented 1 year ago

Two tests are currently failing after removing the distributed==2023.9.2 pin:

FAILED tests/test_local_cuda_cluster.py::test_pre_import_not_found - Failed: DID NOT RAISE <class 'RuntimeError'>
FAILED tests/test_local_cuda_cluster.py::test_death_timeout_raises - Failed: DID NOT RAISE <class 'asyncio.exceptions.TimeoutError'>

We need to bisect to find the source of those regressions, but for now we're xfailing them to allow unpinning going through.

wence- commented 1 year ago

FAILED tests/test_local_cuda_cluster.py::test_pre_import_not_found

This one seems to be because scaling up a speccluster now swallows any errors when scaling up. (https://github.com/dask/distributed/issues/8309)

wence- commented 1 year ago

FAILED tests/test_local_cuda_cluster.py::test_death_timeout_raises

This is the same cause.

pentschev commented 1 year ago

Thanks for digging into that @wence- .

I'm sure this is not the first time this is problematic. This is definitely not something we need to do right now, but I'm wondering if we should do a pass on Dask-CUDA's tests that could be generalized and submit them as part of Distributed's test set to prevent regressions there.