pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Setting rechunk_max_mem to worker memory doesn't work #54

Open rsignell-usgs opened 4 years ago

rsignell-usgs commented 4 years ago

My first time using rechunker! I'm running on HPC with a SlurmCluster, reading and writing zarr to fast local disk.

This runs for about a minute, then barfs with keyerror: startstops: https://nbviewer.jupyter.org/gist/rsignell-usgs/fe0a1ec0cec562b14ddcc9a222ddc34f

TomAugspurger commented 4 years ago

That's a strange one. It's likely an issue in distributed rather than rechunker.

What version of dask are you using? I'm not familiar with how SLURMCluster handles environments, but is the version "on the cluster" the same as your client (/home/rsignell/miniconda3/envs/pangeo3/bin/python?)

rsignell-usgs commented 4 years ago

@TomAugspurger:

  1. I installed pangeo-dask:2020.09.19 from conda-forge, so I'm using:

    (pangeo3) rsignell@denali-login2:~> conda list dask
    # packages in environment at /home/rsignell/miniconda3/envs/pangeo3:
    #
    # Name                    Version                   Build  Channel
    dask                      2.27.0                     py_0    conda-forge
    dask-core                 2.27.0                     py_0    conda-forge
    dask-gateway              0.8.0            py37hc8dfbb8_0    conda-forge
    dask-jobqueue             0.7.1                      py_0    conda-forge
    dask-kubernetes           0.10.1                     py_0    conda-forge
    dask-labextension         3.0.0                      py_0    conda-forge
    pangeo-dask               2020.09.19                    0    conda-forge
  2. My understanding is that Dask-jobqueue (which SlurmCluster) submits jobs using the same environment as the notebook, so all versions should match, right?

rsignell-usgs commented 4 years ago

@TomAugspurger , well, you were right, there is something funky with my use of SlurmCluster. When I tried a LocalCluster it worked fine.

rsignell-usgs commented 4 years ago

@TomAugspurger , and now I've discovered the real problem. It wasn't a problem with SLURMCluster either. My problem was that I was specifying the rechunker max_mem to be the same as the memory on my workers. That apparently doesn't work. When I set my worker memory to 6GB and the rechunker max_mem to 4GB then both LocalCluster and SLURMCluster work fine:

2020-09-29_10-04-28

So specifying max_mem to be 2/3 of the worker memory worked for me. Is it known what fraction should be used?

TomAugspurger commented 4 years ago

I think that rechunker will try to allocate as close to max_mem as possible. That should probably not exceed the variuos thresholds specified in https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.memory.target.

For this workload (where the memory usage can be very tightly controlled), the best option might be setting distributed.worker.memory.{target,spill,pause} very close to terminate (so something like 0.9) and then setting max_mem close to whatever that value is (0.9 * worker memory).

You'll want at least some buffer for other stuff in the Python process (like module objects).

(If that's correct, then we should document it :smile:)

rsignell-usgs commented 3 years ago

For a worker memory of 6GB, I tried systematically reducing the rechunk_max_mem memory from 6GB (didn't work), to 5GB (didn't work), to 4GB (did work). So for this workflow, setting rechunk_max_mem to 2/3 or the worker memory works.