Update: April 15, 2018 - Githubissues

kaipak commented 6 years ago

Getting KubeCluster/Dask/GCP tests to work properly took a little more elbow grease than I expected, but I'm getting consistent results finally. Previously published tests from my forked repo have been updated with this more comprehensive set. Bear in mind, these tests can take a pretty long time to run given that they conduct lots of runs in order to be statistically meaningful, so many of them are truncated at the moment and may display weird results. Longer tests are currently running on GCP and my laptop and will be uploaded whenever they finish.

https://kaipak.github.io/storage-benchmarks/#/

Here's what we've written so far:

10 GB Zarr/Dask Array/KubeCluster/GCSFS on GCP
- Params: chunk sizes: chunks=([5, 10, 50], 1000, 1000), n workers: [5, 10, 20, 40, 80]
- Tests
- read
- write
- compute mean
250 MB Synthetic Geosciences-like dataset Zarr/Xarray/Single Machine
- Storage/Format (all Zarr)
- POSIX
- GCSFS
- FUSE
- Tests
- read
- write
- mean
250 MB Synthetic random Numpy arrays.
- Storage/Format
- Zarr/POSIX
- Zarr/GCSFS
- Zarr/FUSE
- HDF5/POSIX
- HSF5/HSDS
- Tests
- read
- write
- mean
350 LOCA Numpy
- Storage/Format
- HDF5/HSDS

Due to problems I ran into with getting consistent runs in ASV with the plethora of pieces we're dealing with, I didn't get around to documentation or prettying up the plots as I had planned, but will be focusing on that for the next couple days. ASV docs are also quite sparse, so I think it'll be worthwhile have something more comprehensive here--especially since the behavior of some of its settings is not necessarily obvious.

I'd like to more fully detail the tests we have so far and what we plan on working on next. There is no set schedule per se, but I've been roughly favoring getting results out of Dask/Xarray/GCP. In the immediate future, I plan on writing tests that use real data (likely, LLC4320 ccean general circulation simulation output). Since we have all these tests now working for synthetic data, is should be relatively straightforward pointing to actual datasets. Here's my rough idea of a schedule in the next round of test writing.

Fix GCS/FUSE related issues with worker permissions. There's discussion on this in pangeo-data/pangeo#215 and pangeo-data/pangeo#209. GCSFS works with a hack mentioned in pangeo-data/pangeo#209.
Uploading real datasets to GCS from server environments.
Above tests using real datasets.

If there's a particular use case someone is dying to see, I'd be happy to take requests.

rabernat commented 6 years ago

Kai, this is great progress!

As discussed in our meeting today, here are some next priorities:

make sure that you are actually using the number of workers that you expect
make sure the test datasets have enough chunks to actually saturate the workers (if you only have 50 chunks but 80 workers, 30 workers will be idle)
compare your benchmark results to informal benchmarking based on real data analysis on pangeo.pydata.org, just to make sure it's consistent
finally, figure out how to read the .json file of the ASV results directly from a notebook, to make custom plots: start developing a notebook to summarize the results from all your experiments

rabernat commented 6 years ago

And be careful about load vs persist!

pangeo-data / storage-benchmarks

Update: April 15, 2018 #27