pangeo-data / storage-benchmarks

testing performance of different storage layers
Apache License 2.0
12 stars 1 forks source link

DEPRECATION NOTICE

This project is inactive and unfinished. More recent activty is at:

https://github.com/pangeo-data/benchmarking

storage-benchmarks

Modified ASV suite of benchmark tests to gather IO performance metrics in Pangeo environments. Set of tests exist for cloud, HPC, and workstation-like environments across different mixture of storage backends and APIs. We're mainly concerned with benchmarking Xarray/Dask performance in both single and multiprocessor/multithreaded/clustered environments.

airspeedvelocity is the basis of these benchmarks, although workflow has been modified to accomodate gathering IO statistics.

Basics and running the benchmarks.

You typically run ASV benchmarks through its command line tool, but with this implementation, the runs are conducted through a Python script:

usage: run_benchmarks.py [-h] -b BENCHMARK [BENCHMARK ...]
                         [-n N_RUNS [N_RUNS ...]]

Where BENCHMARK is a regex of the benchmark test you'd like to run. For example, if you want to run all the GCP Kubernetes read tests 10 times, you'd execute,

python run_benchmarks.py -b gcp_kubernetes_read* -n 10

This will then generate all the benchmark runs, and scrape the resultant JSON output and append them to a CSV file. Data is collected from most recent ASV JSON results file for the machine the tests are being run on. If your directory has results from a different machine, this script will not collect data from that at this time.

Suite of Tests

The following perfomance tests are conducted:

Storage/Backend/API Combinations

  1. netcdf -> POSIX -> local storage
  2. netcdf -> POSIX -> some sort of disk presentation layer (e.g. FUSE) -> cloud bucket
  3. Zarr -> POSIX -> local storage
  4. Zarr -> cloud bucket
  5. h5netcdf -> POSIX -> local storage
  6. h5netcdf -> hsds -> cloud bucket
  7. TileDB -> cloud (currently only S3)
  8. TileDB -> POSIX (local, Lustre, etc)
  9. TileDB -> HDFS