Closed eric-czech closed 4 years ago
All further discussion on this has moved to https://github.com/related-sciences/data-team/issues/38.
tl;dr CPU saturation is possible with larger chunk sizes, though these chunk sizes are arguably too large to be practical. The thread above talks about several things with one of them being overcoming small chunk sizes by using a different Zarr backend that would place multiple in chunks in a single object. Asynchronous loading of chunks may also alleviate how often CPUs are waiting for data to download, and support for that has recently been added to Zarr.
What's the right way to do this when reading from GS within GKE deployments?
Here is a slide with some monitoring data during a GWAS run with a single VM: