Closed rabernat closed 6 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
closing.
@rabernat wrote: http://pangeo-data.org/data.html#guide-to-preparing-cloud-optimized-data
This basically summarizes the best known workflow for moving zarr-like datasets to the cloud.
I am currently transferring a pretty large dataset (~11 TB) from a local server to gcs. Here is an abridged version basic workflow:
Each chunk in the dataset has 2700 x 3600 elements (about 75 MB), and there are 292000 total chunks in the dataset.
I am doing this through dask.distributed using a single, multi-threaded worker (24 threads). I am watching the progress through the dashboard.
Once I call
to_zarr
, it takes a long time before anything happens (about 1 hour). I can't figure out what dask is doing during this time. At some point the client errors with the following exception:tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7fe371f58a58> exception was never retrieved
. Nevertheless, the computation eventually hits the scheduler, and I can watch its progress.I can see that there are over 1 million tasks. Most of the time is being spent in tasks called
open_dataset-concatenate
andstore-concatenate
. There are 315360 of each task, and each takes about ~20s. Doing the math, at this rate it will take a couple of days to upload the data, this is slower than scp by a factor of 2-5.I'm not sure if it's possible to do better. Just raising this issue to start a discussion.
A command line utility to import netcdf directly to gcs/zarr would be a very useful tool to have.