Open rsignell-usgs opened 4 years ago
That's a strange one. It's likely an issue in distributed rather than rechunker.
What version of dask are you using? I'm not familiar with how SLURMCluster handles environments, but is the version "on the cluster" the same as your client (/home/rsignell/miniconda3/envs/pangeo3/bin/python
?)
@TomAugspurger:
I installed pangeo-dask:2020.09.19
from conda-forge, so I'm using:
(pangeo3) rsignell@denali-login2:~> conda list dask
# packages in environment at /home/rsignell/miniconda3/envs/pangeo3:
#
# Name Version Build Channel
dask 2.27.0 py_0 conda-forge
dask-core 2.27.0 py_0 conda-forge
dask-gateway 0.8.0 py37hc8dfbb8_0 conda-forge
dask-jobqueue 0.7.1 py_0 conda-forge
dask-kubernetes 0.10.1 py_0 conda-forge
dask-labextension 3.0.0 py_0 conda-forge
pangeo-dask 2020.09.19 0 conda-forge
My understanding is that Dask-jobqueue (which SlurmCluster) submits jobs using the same environment as the notebook, so all versions should match, right?
@TomAugspurger , well, you were right, there is something funky with my use of SlurmCluster. When I tried a LocalCluster it worked fine.
@TomAugspurger , and now I've discovered the real problem. It wasn't a problem with SLURMCluster
either. My problem was that I was specifying the rechunker max_mem
to be the same as the memory on my workers. That apparently doesn't work. When I set my worker memory to 6GB
and the rechunker max_mem
to 4GB
then both LocalCluster
and SLURMCluster
work fine:
So specifying max_mem
to be 2/3 of the worker memory worked for me. Is it known what fraction should be used?
I think that rechunker
will try to allocate as close to max_mem
as possible. That should probably not exceed the variuos thresholds specified in https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.memory.target.
For this workload (where the memory usage can be very tightly controlled), the best option might be setting distributed.worker.memory.{target,spill,pause}
very close to terminate
(so something like 0.9) and then setting max_mem
close to whatever that value is (0.9 * worker memory).
You'll want at least some buffer for other stuff in the Python process (like module objects).
(If that's correct, then we should document it :smile:)
For a worker memory of 6GB, I tried systematically reducing the rechunk_max_mem
memory from 6GB (didn't work), to 5GB (didn't work), to 4GB (did work). So for this workflow, setting rechunk_max_mem
to 2/3 or the worker memory works.
My first time using rechunker! I'm running on HPC with a SlurmCluster, reading and writing zarr to fast local disk.
This runs for about a minute, then barfs with
keyerror: startstops
: https://nbviewer.jupyter.org/gist/rsignell-usgs/fe0a1ec0cec562b14ddcc9a222ddc34f