mila-iqia / training

8 stars 7 forks source link

IMPORTANT: This repository is deprecated.

The benchmarks are now located at: https://github.com/mila-iqia/milabench

Training Benchmarks

Install

$ ./install_dependencies.sh
$ ./install_conda.sh

# reload bash with anaconda
$ exec bash
$ conda activate mlperf
$ ./install_python_dependencies.sh

# Install pytorch
$ conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
$ ./cgroup_setup.sh
$ export BASE=~/data/
$ ./download_datasets.sh

Installing on AMD

The baselines should work on AMD GPUs, provided one installs a compatible version of PyTorch. For AMD GPUs, instead of conda install pytorch, follow these instructions: https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Building-PyTorch-for-ROCm#option-4-install-directly-on-host

Executing the benchmarks

The benchmarks should be run in the conda environment created during the installation.

# Set up environment and necessary variables
$ conda activate mlperf
$ export BASE=~/data/
$ export OUTDIR=~/results-$(date '+%Y-%m-%d.%H:%M:%S')/

# To run only once:
$ ./run.sh --jobs baselines.json --outdir $OUTDIR

# To run ten times:
$ ./run_10.sh --jobs baselines.json --outdir $OUTDIR

The test results will be stored as json files in the specified outdir, one file for each test. A result will have a name such as baselines.vae.R0.D0.20200106-160000-123456.json, which means the test named vae from the jobs file profiles/baselines.json, run 0, device 0, and then the date and time. If the tests are run 10 times and there are 8 GPUs, you should get 80 of these files for each test (R0 through R9 and D0 through D7). If a test fails, the filename will also contain the word FAIL (note: run number N corresponds to the option --uid N to run.sh).

Reporting the results

The mlbench-report tool (the install procedure will install it automatically) can be used to generate an HTML report:

$ mlbench-report --name baselines --reports $OUTDIR --gpu-model RTX --title "Results for RTX" --html report.html

You may open the HTML report in any browser. It reports numeric performance results as compared to existing baselines for the chosen GPU model, results for all pass/fail criteria, a global score, and some handy tables comparing all GPUs to each other and highlighting performance discrepancies between them.

The command also accepts a --price argument (in dollars) to compute the price/score ratio.

Running a specific test

To run a specific test, for example the vae test:

./run.sh --jobs baselines.json --name vae --outdir $OUTDIR

This is useful if one or more of the tests fail.

Baseline tweaks

You can create tweaked baselines by modifying a copy of baselines.json. These tweaked baselines may be used to either test something different, debug, or demonstrate further capacities, if needed.

$ cp profiles/baselines.json profiles/tweaked.json

# modify tweaked.json to reflect the device capacity

$ ./run.sh --jobs tweaked.json --outdir $OUTDIR  # run the tweaked version

Docker [Experimental]

You can use cgroups and docker using the script below.

$ sudo docker run --cap-add=SYS_ADMIN --security-opt=apparmor:unconfined -it my_docker
$ apt-get install cgroup-bin cgroup-lite libcgroup1
$ mount -t tmpfs cgroup_root /sys/fs/cgroup

$ mkdir /sys/fs/cgroup/cpuset
$ mount -t cgroup cpuset -o cpuset /sys/fs/cgroup/cpuset

$ mkdir /sys/fs/cgroup/memory
$ mount -t cgroup memory -o memory /sys/fs/cgroup/memory

Tips

FAQ

ROcm Cache

ROCm cache is structured by default like so .cache/miopen/2.1.0/<kernel_hash>/<compiled_kernel>*.cl.o, and the performance database is located at ~/.config/miopen/gfx906_60.HIP.2_1_0.ufdb.txt

We provide a zipped version of the miopen cache folder and a copy of our performance database file than you can unzip in your own cache location to speed up the first run of the benchmark.

unzip training/common/miopen.zip -d .cache/

cp training/common/gfx906_60.HIP.2_1_0.ufdb.txt ~/.config/miopen/

NB: the compile cache and performance database are both version dependent. It will only work if your version of MIOpen matches ours.

Details

The idea is to have one consolidated repo that can run every bench in one run (as opposed to MLPerf approach of everybody doing their thing).

There is a single requirements.txt that consolidates all the requirements of all the examples, which means the dependencies need to play nice.

-- NO DOCKER --

Directory Layout

$task/$model/$framework/run*.sh...
$task/$model/download_dataset.sh

The run script downloads the dataset and run each run script one by one. each script source the config.env before running. The file defines the location of the data sets and useful location (temp, data, output) as well as if cuda is available, the number of devices and processors available.

Each run*.sh script should be runnable from any working directory

Datasets

du -hd 2 data/
205M    data/wmt16/mosesdecoder
4.5G    data/wmt16/data
828K    data/wmt16/subword-nmt
13G     data/wmt16
16G     data/ImageNet/train
16G     data/ImageNet
73M     data/bsds500/BSR
73M     data/bsds500
53M     data/mnist/raw
106M    data/mnist/MNIST
53M     data/mnist/processed
211M    data/mnist
19G     data/coco/train2017
796M    data/coco/annotations
788M    data/coco/val2017
20G     data/coco
1.2M    data/time_series_prediction
1.8G    data/ml-20m
50G     data/

Benchmark methodology

For each test we measure the compute time of number batches repeat times, discard the first few observations and report the average. To get the samples per second, you need to compute batch_size / (train_time / number)

for r in range(args.repeat):

    with chrono.time('train') as t:

        for n in range(args.number):
            batch = next(batch_iterator)
            train(batch)

    print(f'[{r:3d}/{args.repeat}] ETA: {t.avg * (args.repeat - (r + 1)) / 60:6.2f} min')

Report output sample

{
    "batch-size": 128,
    "repeat": 25,
    "number": 5,
    "train": {
            "avg": 14.0371,
            "count": 20,
            "max": 20.0015,
            "min": 11.922,
            "sd": 1.9162,
            "unit": "s"
        },
    "train_item": {
        "avg": 45.59,      // 128 * 5 / 14.037
        "max": 53.68,      // 128 * 5 / 11.922
        "min": 31.98,      // 128 * 5 / 20.015
        "range" 21.69,
        "unit": "items/sec"   // img/sec in case of Image Batch
    }
 }