heuristics for optimizing rechunker and dask and parallel filesystem

satra commented 2 years ago

first off - thank you for this awesome library. it has been a lifesaver.

we need to rechunk about 200TB of zarr objects stored in files that range from 10GB to 200GB. the highest resolution data array sizes is 2048x2048x15r-50k and has been stored in 64^3 chunks. i'm converting to 128^3 (see: https://discourse.pangeo.io/t/any-suggestions-s3-upload-optimizations-for-large-3d-zarr-datasets/2117 and https://forum.image.sc/t/deciding-on-optimal-chunk-size/63023). at 128^3, a lot of our problems have reduced, but we can't go any larger. both sides use blosc(zstd) compression under the hood.

i have been trying to figure out how best to optimize the rechunking between the filesystem (beegfs) and the number of processes and threads. i have a separate python script that runs a wrapper around rechunking (does a bunch of checks and book-keeping). this script is launched in processpool that runs one rechunker per zarr object. the rechunk wrapper initializes a dask client to connect to a cluster that i'm starting separately (an example setting: --nworkers 30 --memory-limit 50GB --nthreads 10 - on a fairly beefy node 112 hyperthreaded cores, 750GB ram, connected over a 100G pipe to the file storage servers). i've tried setting the rechunker max_mem (e.g., 100M, 500M, 2G, 5G).

i've tried various permutations given all the interactions between filesystem and node, but i can't seem to figure out a good heuristic as to what the workers/thread/mem settings should be. (e.g., converting about 700G took about 2.5 hours). i don't need optimal, but currently i'm not really crossing 100 MiB/s read/write on average on the filesystem. at the 5G max_mem, the workers alternate between reading for a while at 100MiB/s and the writing (sometimes up to 500 MiB/s, but only for short periods).

i think part of this behavior comes from the many tiny files for the 64^3 source zarr.

i'm reaching out just in case someone has some ideas for heuristics.

rabernat commented 2 years ago

Hi @satra - sorry no one ever got back to you! The developers of rechunker are mostly working with cloud object storage, where the "many tiny files" constrain is not such a problem.

Let us know if you were able to find some good setting.

satra commented 2 years ago

@rabernat - no worries. i found something that worked "optimally" for my scenario of reading and writing to the same filesystem (5 rechunking processes at a time with about 80 workers, 16 GB/worker and 2 threads). it took a while, but hopefully i won't have to do this too often :)

rabernat commented 2 years ago

Glad you got it working. It would be great to build up some documentation about good configurations. I may follow up with you later to get more details to add to the docs.

pangeo-data / rechunker

heuristics for optimizing rechunker and dask and parallel filesystem #108