IMPORTANT: This repository is deprecated.
The benchmarks are now located at: https://github.com/mila-iqia/milabench
$ ./install_dependencies.sh
$ ./install_conda.sh
# reload bash with anaconda
$ exec bash
$ conda activate mlperf
$ ./install_python_dependencies.sh
# Install pytorch
$ conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
$ ./cgroup_setup.sh
$ export BASE=~/data/
$ ./download_datasets.sh
The baselines should work on AMD GPUs, provided one installs a compatible version of PyTorch. For AMD GPUs, instead of conda install pytorch
, follow these instructions: https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm#option-4-install-directly-on-host
The benchmarks should be run in the conda environment created during the installation.
# Set up environment and necessary variables
$ conda activate mlperf
$ export BASE=~/data/
$ export OUTDIR=~/results-$(date '+%Y-%m-%d.%H:%M:%S')/
# To run only once:
$ ./run.sh --jobs baselines.json --outdir $OUTDIR
# To run ten times:
$ ./run_10.sh --jobs baselines.json --outdir $OUTDIR
The test results will be stored as json files in the specified outdir, one file for each test. A result will have a name such as baselines.vae.R0.D0.20200106-160000-123456.json
, which means the test named vae
from the jobs file profiles/baselines.json
, run 0, device 0, and then the date and time. If the tests are run 10 times and there are 8 GPUs, you should get 80 of these files for each test (R0
through R9
and D0
through D7
). If a test fails, the filename will also contain the word FAIL
(note: run number N corresponds to the option --uid N
to run.sh
).
The mlbench-report
tool (the install procedure will install it automatically) can be used to generate an HTML report:
$ mlbench-report --name baselines --reports $OUTDIR --gpu-model RTX --title "Results for RTX" --html report.html
You may open the HTML report in any browser. It reports numeric performance results as compared to existing baselines for the chosen GPU model, results for all pass/fail criteria, a global score, and some handy tables comparing all GPUs to each other and highlighting performance discrepancies between them.
The command also accepts a --price
argument (in dollars) to compute the price/score ratio.
To run a specific test, for example the vae test:
./run.sh --jobs baselines.json --name vae --outdir $OUTDIR
This is useful if one or more of the tests fail.
You can create tweaked baselines by modifying a copy of baselines.json
. These tweaked baselines may be used to either test something different, debug, or demonstrate further capacities, if needed.
$ cp profiles/baselines.json profiles/tweaked.json
# modify tweaked.json to reflect the device capacity
$ ./run.sh --jobs tweaked.json --outdir $OUTDIR # run the tweaked version
You can use cgroups and docker using the script below.
$ sudo docker run --cap-add=SYS_ADMIN --security-opt=apparmor:unconfined -it my_docker
$ apt-get install cgroup-bin cgroup-lite libcgroup1
$ mount -t tmpfs cgroup_root /sys/fs/cgroup
$ mkdir /sys/fs/cgroup/cpuset
$ mount -t cgroup cpuset -o cpuset /sys/fs/cgroup/cpuset
$ mkdir /sys/fs/cgroup/memory
$ mount -t cgroup memory -o memory /sys/fs/cgroup/memory
The benchmark starts with two toy examples to make sure everything is setup properly.
Each bench run N_GPU
times in parallel with only N_CPU / N_GPU
and RAM / N_GPU
to simulate multiple users.
Some tasks are allowed to use the machine entirely (scaling
)
When installing pytorch you have to make sure that it is compiled with LAPACK (for the QR decomposition)
mlbench-report
can be used at any time to check current results.
Stop a run that is in progress
kill -9 $(ps | grep run | awk '{print $1}' | paste -s -d ' ')
kill -9 $(ps | grep python | awk '{print $1}' | paste -s -d ' ')
When running using the AMD stack the initial compilation of each models can take a significant amount of time. You can remove the compilation step by using Mila's miopen compilation cache. To use it you can simply execute copy_rocm_cache.sh
.
If your machine supports SSE vector instructions you are allowed to replace it with pillow-simd for faster load times
For machines with NUMA nodes cgroups might be set manually by the users. If the constraint below are met
Do all these benchmarks run/use GPUs or are some of them solely CPU-centric?
convnet and convnet_fp16 seem to be single GPU benchmarks but nvidia-smi shows activity on all GPUs in a node. Are the other GPUs used for workers?
We are using docker and sudo
is not necessary
export SUDO=''
to not use sudoIs there a multi-node benchmark in convnets ? If yes, what was the reference run configuration ?
What does the cgroup script do? It looks like it is an environment-specific script and may not be relevant to our environment. Can we comment out that line and run the script?
While running fp16 tasks, the warnings below are shown:
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Attempting to unscale a grad with type torch.cuda.HalfTensor Unscaling non-fp32 grads may indicate an error. When using Amp, you don't need to call .half() on your model.
ROCm cache is structured by default like so .cache/miopen/2.1.0/<kernel_hash>/<compiled_kernel>*.cl.o
, and
the performance database is located at ~/.config/miopen/gfx906_60.HIP.2_1_0.ufdb.txt
We provide a zipped version of the miopen cache folder and a copy of our performance database file than you can unzip in your own cache location to speed up the first run of the benchmark.
unzip training/common/miopen.zip -d .cache/
cp training/common/gfx906_60.HIP.2_1_0.ufdb.txt ~/.config/miopen/
NB: the compile cache and performance database are both version dependent. It will only work if your version of MIOpen matches ours.
The idea is to have one consolidated repo that can run every bench in one run (as opposed to MLPerf approach of everybody doing their thing).
There is a single requirements.txt
that consolidates all the requirements of all the examples, which means the dependencies need to play nice.
-- NO DOCKER --
$task/$model/$framework/run*.sh...
$task/$model/download_dataset.sh
The run script downloads the dataset and run each run script one by one.
each script source the config.env
before running. The file defines the location of the data sets and useful location
(temp, data, output) as well as if cuda is available, the number of devices and processors available.
Each run*.sh
script should be runnable from any working directory
du -hd 2 data/
205M data/wmt16/mosesdecoder
4.5G data/wmt16/data
828K data/wmt16/subword-nmt
13G data/wmt16
16G data/ImageNet/train
16G data/ImageNet
73M data/bsds500/BSR
73M data/bsds500
53M data/mnist/raw
106M data/mnist/MNIST
53M data/mnist/processed
211M data/mnist
19G data/coco/train2017
796M data/coco/annotations
788M data/coco/val2017
20G data/coco
1.2M data/time_series_prediction
1.8G data/ml-20m
50G data/
Through Academic Torrent
Fake datasets:
For each test we measure the compute time of number
batches repeat
times, discard the first few observations and report the average. To get the samples per second, you need to compute batch_size / (train_time / number)
for r in range(args.repeat):
with chrono.time('train') as t:
for n in range(args.number):
batch = next(batch_iterator)
train(batch)
print(f'[{r:3d}/{args.repeat}] ETA: {t.avg * (args.repeat - (r + 1)) / 60:6.2f} min')
Report output sample
{
"batch-size": 128,
"repeat": 25,
"number": 5,
"train": {
"avg": 14.0371,
"count": 20,
"max": 20.0015,
"min": 11.922,
"sd": 1.9162,
"unit": "s"
},
"train_item": {
"avg": 45.59, // 128 * 5 / 14.037
"max": 53.68, // 128 * 5 / 11.922
"min": 31.98, // 128 * 5 / 20.015
"range" 21.69,
"unit": "items/sec" // img/sec in case of Image Batch
}
}