Scripts for benchmarking performance of Zarr and HDF5 in different contexts.
parallel_read_write.py
is a command line tool for reading and writing data in parallel using MPI, HDF5, and zarr.
Usage: parallel_read_write.py [OPTIONS]
Options:
--nsteps INTEGER Number of iterations to perform
--size INTEGER Length of each row of array
--output_dir TEXT Where to write the data
--compression [none|gzip]
--nested whether to use zarr NestedDirectoryStore
--help Show this message and exit.
I suspect that the results are quite sensitive the the MPI configuration.
The conda environment is specified in environment.yaml
.
I am using all mpi libraries, hd5 libraries, etc installed from conda-forge.
This might not be optimal.
The data is output from the scripts in CSV format and committed directly to this repo. The scripts can be run over and over, producing more results, which can just be added incrementally. You should never have to delete data.
The following scripts can be run on Cheyenne in batch mode. There are a few module tricks to make things work right. Again, these could be affecting performance.
PBS_run_script_cheyenne_singlenode.sh
- profile read and write on a single node as function of MPI procs. Results stored in data_cheyenne_singlenode
.PBS_run_script_cheyenne_multinode.sh
- profile read and write on on multiple nodes using 9 mpiprocs per node. Results stored in data_cheyenne_mpiprocs09
.analyze_results.py <DATA_DIR>
- dump a bunch of results to the terminal in text form.plot_all_results.ipynb
- notebook for plotting up the results.