Open csubich opened 7 months ago
I suspect this is a lack of thread safety in dask. In particular, the tracebacks are internal dask data structures. Is there anything to suggest this something we could fix from the xarray side?
cc @fjetter @phofl for awareness
What happened?
I have a large-ish weather dataset (mirrored version of a subset of WeatherBench2 data), which is stored on-disk as a collection of Zarr datasets, segregated by time. Because processing is intensive and GPU-related, I handle the data in parallel as:
Open the large dataset with:
Create several worker threads to assemble randomly-selected subsets of this data, using slicing on the generated Dask arrays[^1]:
Use
dask.distributed.Client
in threading-only mode (processes=False
) to convert the lazy Dask arrays into in-memory arrays with a thread-blocking call tocompute
, andReturn the realized dataset to the main thread for subsequent processing, via a thread-safe queue.
When doing so, I intermittently receive two categories of exceptions from dask. The first is a
SystemError
:while the second is a KeyNotFound error:
The computation will often succeed on a subsequent attempt if I catch the exception raised by
.compute
and re-execute the call.This second error seems to disappear[^2] if I pass
cache=False
toopen_mfdataset
, but the first persists.Both errors seem to disappear if I use only a single worker thread[^3] or if I restrict the data being used to only one file's worth (that is, still open all the files from
open_mfdataset
, but only use entries that I know come from one of those source Zarr trees).[^1]: I was previously using
.sel
to slice the data in a more semantic way, but I switched to the lower-level slicing to rule out any thread-unsafety inside the non-delayed-computing parts of Xarray when dealing with a shared dataset.[^2]: The errors are random in exactly the annoying way that threading errors are. One factor making them more likely is debugging printouts, which seems to disrupt thread scheduling enough to trigger them more often.
[^3]: Unfortunately, this results in an unacceptably slow performance. Loading the data in parallel really does improve throughput. I haven't yet tried using the asynchronous model for the dask client.
What did you expect to happen?
No response
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment