related-sciences / ukb-gwas-pipeline-nealelab

Pipeline for reproduction of NealeLab 2018 UKB GWAS
4 stars 3 forks source link

Figure out how to identify network bottlenecks #18

Closed eric-czech closed 4 years ago

eric-czech commented 4 years ago

What's the right way to do this when reading from GS within GKE deployments?

Here is a slide with some monitoring data during a GWAS run with a single VM:

Screen Shot 2020-09-01 at 5 12 29 PM

eric-czech commented 4 years ago

All further discussion on this has moved to https://github.com/related-sciences/data-team/issues/38.

tl;dr CPU saturation is possible with larger chunk sizes, though these chunk sizes are arguably too large to be practical. The thread above talks about several things with one of them being overcoming small chunk sizes by using a different Zarr backend that would place multiple in chunks in a single object. Asynchronous loading of chunks may also alleviate how often CPUs are waiting for data to download, and support for that has recently been added to Zarr.