This repo provides benchmarks for common data analysis tasks in atmospheric science accomplished with several different tools:
For an in-depth description and comparison of these tools, see TOOLS.md.
To run benchmarks, you must first generate sample data. This is done with the
data_generator.py
script. It generates NetCDF files of arbitrary resolution containing
artificial temperature, zonal wind, and meridional wind data.
It requires xarray
and dask
.
Usage:
./data_generator.py RESO [-h|--help] [-l|--lev=NLEV] [-t|--time=NTIME] [-d|--dir=DIR]
For the below results, data was generated as follows:
for reso in 20 10 7.5 5 3 2 1.5; do ./data_generator.py $reso -l=60 -t=200; done
To run your own benchmarks, use the shell scripts in the top-level directory.
Example usage:
./test_name.sh DIR
DIR
is the directory containing the sample NetCDF files.
To make your own test_name.sh
benchmark, start by copying an existing benchmark and
work from there. Each test_name.sh
benchmark does the following:
header.sh
. This script declares some bash functions and
cd
s into the testname
directory, where the language-specific test scripts must be
stored.DIR
. Run the init
bash function at the top
of the loop, then for each script in the testname
directory, pass the command-line
call signature to the benchmark
bash function. For example: to run
python test.py file.nc
, we use benchmark python test.py file.nc
.If the benchmark requires saving data, it should be saved into the out
folder inside
the testname
directory. Note that header.sh
also creates a special python
function
that lets you name your python files the same name as existing python packages. For
example: xarray.py
is a valid file name.
Results for each file are saved to markdown-style tables in the results
directory. To
generate plots of these tables (see below for an example), use the plots.ipynb
IPython
notebook. This requires the numpy and
ProPlot packages. ProPlot is a matplotlib
wrapper I developed.
For this benchmark, eddy fluxes of heat and momentum were calculated and saved into new NetCDF files.
Climate Data Operators (CDO) is the clear winner here, followed closely by MATLAB, Fortran, and python in the same pack, depending on the file size. For files smaller than 100MB though, the differences are pretty small, and the NCAR Command Language (NCL) and NetCDF Operators (NCO) are acceptable choices. Xarray combined with dasks really shines when used on machines with lots of cores.
The benchmark was run on my macbook (first plot), and on the Cheyenne HPC compute cluster interactive node (second plot), which is a shared resource consisting of approximately 72 cores.
For this benchmark, the first quarter of timesteps were selected and saved into a new NetCDF file.
The results here were interesting. NCO is the winner for small files, but CDO beats it for large files, at which point the time required for overhead operations is negligible. XArray is the slowest across all file sizes.
Coming soon!
Coming soon!
Coming soon!
This benchmark compares the only two obvious tools for interpolating between isobars and isentropes: NCL, and python using the MetPy package.
This time, NCL was the clear winner! The MetPy script was also issuing a bunch of warnings when it ran. Evidently, the kinks in the MetPy algorithm haven't been ironed out yet.