qzan9 / osu-micro-benchmarks

Add HIP support to test GPUDirect capability of Hygon DCU
Other
4 stars 1 forks source link

OMB (OSU Micro-Benchmarks)

The OSU Micro-Benchmarks use the GNU build system. Therefore you can simply use the following steps to build the MPI benchmarks.

Example: ./configure CC=/path/to/mpicc CXX=/path/to/mpicxx make make install

CC and CXX can be set to other wrapper scripts as well to build OpenSHMEM or UPC++ benchmarks as well. Based on this setting, configure will detect whether your library supports MPI-1, MPI-2, MPI-3, OpenSHMEM, and UPC++ to compile the corresponding benchmarks. See http://mvapich.cse.ohio-state.edu/benchmarks/ to download the latest version of this package.

This package also distributes UPC put, get, and collective benchmarks. These are located in the upc subdirectory and can be compiled by the following:

    for bench in osu_upc_memput              \
                 osu_upc_memget              \
                 osu_upc_all_scatter         \
                 osu_upc_all_reduce          \
                 osu_upc_all_gather          \
                 osu_upc_all_gather_all      \
                 osu_upc_all_exchange        \
                 osu_upc_all_broadcast       \
                 osu_upc_all_barrier
    do
        echo "Compiling $bench..."
        upcc $bench.c ../util/osu_util.c -o $bench
    done

The MPI Multiple Bandwidth / Message Rate (osu_mbw_mr), OpenSHMEM Put Message Rate (osu_oshm_put_mr), and OpenSHMEM Atomics (osu_oshm_atomics) tests are intended to be used with block assigned ranks. This means that all processes on the same machine are assigned ranks sequentially.

Rank Block Cyclic

0 host1 host1 1 host1 host2 2 host1 host1 3 host1 host2 4 host2 host1 5 host2 host2 6 host2 host1 7 host2 host2

If you're using mpirun_rsh the ranks are assigned in the order they are seen in the hostfile or on the command line. Please see your process managers' documentation for information on how to control the distribution of the rank to host mapping.

Point-to-Point MPI Benchmarks

osu_latency - Latency Test

osu_latency_mt - Multi-threaded Latency Test

osu_bw - Bandwidth Test

osu_bibw - Bidirectional Bandwidth Test

osu_mbw_mr - Multiple Bandwidth / Message Rate Test

osu_multi_lat - Multi-pair Latency Test

Collective MPI Benchmarks

osu_allgather - MPI_Allgather Latency Test() osu_allgatherv - MPI_Allgatherv Latency Test osu_allreduce - MPI_Allreduce Latency Test osu_alltoall - MPI_Alltoall Latency Test osu_alltoallv - MPI_Alltoallv Latency Test osu_barrier - MPI_Barrier Latency Test osu_bcast - MPI_Bcast Latency Test osu_gather - MPI_Gather Latency Test() osu_gatherv - MPI_Gatherv Latency Test osu_reduce - MPI_Reduce Latency Test osu_reduce_scatter - MPI_Reduce_scatter Latency Test osu_scatter - MPI_Scatter Latency Test(*) osu_scatterv - MPI_Scatterv Latency Test

Collective Latency Tests

Support for CUDA Managed Memory

The following benchmarks have been extended to evaluate performance of MPI communication from and to buffers allocated using CUDA Managed Memory.

* osu_bibw              - Bidirectional Bandwidth Test
* osu_bw                - Bandwidth Test
* osu_latency           - Latency Test
* osu_allgather         - MPI_Allgather Latency Test
* osu_allgatherv        - MPI_Allgatherv Latency Test
* osu_allreduce         - MPI_Allreduce Latency Test
* osu_alltoall          - MPI_Alltoall Latency Test
* osu_alltoallv         - MPI_Alltoallv Latency Test
* osu_bcast             - MPI_Bcast Latency Test
* osu_gather            - MPI_Gather Latency Test
* osu_gatherv           - MPI_Gatherv Latency Test
* osu_reduce            - MPI_Reduce Latency Test
* osu_reduce_scatter    - MPI_Reduce_scatter Latency Test
* osu_scatter           - MPI_Scatter Latency Test
* osu_scatterv          - MPI_Scatterv Latency Test

In addition to support for communications to and from GPU memories allocated using CUDA or OpenACC, we now provide additional capability of performing communications to and from buffers allocated using the CUDA Managed Memory concept. CUDA Managed (or Unified) Memory allows applications to allocate memory on either CPU or GPU memories using the cudaMallocManaged() call. This allows user oblivious transfer of the memory buffer between the CPU or GPU. Currently, we offer benchmarking with CUDA Managed Memory using the tests mentioned above.

These benchmarks have additional options:

Non-Blocking Collective MPI Benchmarks

osu_iallgather - MPI_Iallgather Latency Test osu_iallgatherv - MPI_Iallgatherv Latency Test osu_iallreduce - MPI_Iallreduce Latency Test osu_ialltoall - MPI_Ialltoall Latency Test osu_ialltoallv - MPI_Ialltoallv Latency Test osu_ialltoallw - MPI_Ialltoallw Latency Test osu_ibarrier - MPI_Ibarrier Latency Test osu_ibcast - MPI_Ibcast Latency Test osu_igather - MPI_Igather Latency Test osu_igatherv - MPI_Igatherv Latency Test osu_ireduce - MPI_Ireduce Latency Test osu_iscatter - MPI_Iscatter Latency Test osu_iscatterv - MPI_Iscatterv Latency Test

Non-Blocking Collective Latency Tests

One-sided MPI Benchmarks

osu_put_latency - Latency Test for Put with Active/Passive Synchronization

osu_get_latency - Latency Test for Get with Active/Passive Synchronization

osu_put_bw - Bandwidth Test for Put with Active/Passive Synchronization

osu_get_bw - Bandwidth Test for Get with Active/Passive Synchronization

osu_put_bibw - Bi-directional Bandwidth Test for Put with Active Synchronization

osu_acc_latency - Latency Test for Accumulate with Active/Passive Synchronization

osu_cas_latency - Latency Test for Compare and Swap with Active/Passive Synchronization

osu_fop_latency - Latency Test for Fetch and Op with Active/Passive Synchronization

osu_get_acc_latency - Latency Test for Get_accumulate with Active/Passive Synchronization

Point-to-Point OpenSHMEM Benchmarks

osu_oshm_put.c - Latency Test for OpenSHMEM Put Routine

osu_oshm_put_nb.c - Latency Test for OpenSHMEM Non-blocking Put Routine

osu_oshm_get.c - Latency Test for OpenSHMEM Get Routine

osu_oshm_get_nb.c - Latency Test for OpenSHMEM Non-blocking Get Routine

osu_oshm_put_mr.c - Message Rate Test for OpenSHMEM Put Routine

osu_oshm_put_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Put Routine

osu_oshm_get_mr_nb.c - Message Rate Test for Non-blocking OpenSHMEM Get Routine

osu_oshm_put_overlap.c - Non-blocking Message Rate Overlap Test

osu_oshm_atomics.c - Latency and Operation Rate Test for OpenSHMEM Atomics Routines

Collective OpenSHMEM Benchmarks

osu_oshm_collect - OpenSHMEM Collect Latency Test osu_oshm_fcollect - OpenSHMEM FCollect Latency Test osu_oshm_broadcast - OpenSHMEM Broadcast Latency Test osu_oshm_reduce - OpenSHMEM Reduce Latency Test osu_oshm_barrier - OpenSHMEM Barrier Latency Test

Collective Latency Tests

Point-to-Point UPC Benchmarks

osu_upc_memput.c - Put Latency

osu_upc_memget.c - Get Latency

Collective UPC Benchmarks

osu_upc_all_barrier - UPC Barrier Latency Test osu_upc_all_broadcast - UPC Broadcast Latency Test osu_upc_all_scatter - UPC Scatter Latency Test osu_upc_all_gather - UPC Gather Latency Test osu_upc_all_gather_all - UPC GatherAll Latency Test osu_upc_all_reduce - UPC Reduce Latency Test osu_upc_all_exchange - UPC Exchange Latency Test

Collective Latency Tests

Point-to-Point UPC++ Benchmarks

osu_upcxx_async_copy_put.c - Put Latency

osu_upcxx_async_copy_get.c - Get Latency

Collective UPC++ Benchmarks

osu_upcxx_allgather - UPC++ Allgather Latency Test osu_upcxx_alltoall - UPC++ Alltoall Latency Test osu_upcxx_bcast - UPC++ Broadcast Latency Test osu_upcxx_gather - UPC++ Gather Latency Test osu_upcxx_reduce - UPC++ Reduce Latency Test osu_upcxx_scatter - UPC++ Scatter Latency Test

Collective Latency Tests

Startup Benchmarks

osu_init.c - This benchmark measures the minimum, maximum, and average time

osu_hello.c - This is a simple hello world program. Users can take advantage of

CUDA and OpenACC Extensions to OMB

CUDA Extensions to OMB can be enable by configuring the benchmark suite with --enable-cuda option as shown below. Similarly, OpenACC Extensions can be enabled by specifying the --enable-openacc option. The MPI library used should be able to support MPI communication from buffers in GPU Device memory.

./configure CC=/path/to/mpicc 
            CXX=/path/to/mpicxx
            --enable-cuda 
            --with-cuda-include=/path/to/cuda/include
            --with-cuda-libpath=/path/to/cuda/lib
make
make install

The following benchmarks have been extended to evaluate performance of MPI communication using buffers on NVIDIA GPU devices.

osu_bibw           - Bidirectional Bandwidth Test
osu_bw             - Bandwidth Test
osu_latency        - Latency Test
osu_put_latency    - Latency Test for Put
osu_get_latency    - Latency Test for Get
osu_put_bw         - Bandwidth Test for Put
osu_get_bw         - Bandwidth Test for Get
osu_put_bibw       - Bidirectional Bandwidth Test for Put
osu_acc_latency    - Latency Test for Accumulate
osu_cas_latency    - Latency Test for Compare and Swap
osu_fop_latency    - Latency Test for Fetch and Op
osu_allgather      - MPI_Allgather Latency Test
osu_allgatherv     - MPI_Allgatherv Latency Test
osu_allreduce      - MPI_Allreduce Latency Test
osu_alltoall       - MPI_Alltoall Latency Test
osu_alltoallv      - MPI_Alltoallv Latency Test
osu_bcast          - MPI_Bcast Latency Test
osu_gather         - MPI_Gather Latency Test
osu_gatherv        - MPI_Gatherv Latency Test
osu_reduce         - MPI_Reduce Latency Test
osu_reduce_scatter - MPI_Reduce_scatter Latency Test
osu_scatter        - MPI_Scatter Latency Test
osu_scatterv       - MPI_Scatterv Latency Test
osu_iallgather     - MPI_Iallgather Latency Test
osu_iallgatherv    - MPI_Iallgatherv Latency Test
osu_iallreduce     - MPI_Iallreduce Latency Test
osu_ialltoall      - MPI_Ialltoall Latency Test
osu_ialltoallv     - MPI_Ialltoallv Latency Test
osu_ialltoallw     - MPI_Ialltoallw Latency Test
osu_ibcast         - MPI_Ibcast Latency Test
osu_igather        - MPI_Igather Latency Test
osu_igatherv       - MPI_Igatherv Latency Test
osu_ireduce        - MPI_Ireduce Latency Test
osu_iscatter       - MPI_Iscatter Latency Test
osu_iscatterv      - MPI_Iscatterv Latency Test

If both CUDA and OpenACC support is enabled you can switch between the modes using the -d [cuda|openacc] option to the benchmarks. Whether a process allocates its communication buffers on the GPU device or on the host can be controlled at run-time. Use the -h option for more help.

./osu_latency -h
Usage: osu_latency [options] [RANK0 RANK1]

RANK0 and RANK1 may be `D' or `H' which specifies whether
the buffer is allocated on the accelerator device or host
memory for each mpi rank

options:
  -d TYPE   accelerator device buffers can be of TYPE `cuda' or `openacc'
  -h        print this help message

Each of the pt2pt benchmarks takes two input parameters. The first parameter indicates the location of the buffers at rank 0 and the second parameter indicates the location of the buffers at rank 1. The value of each of these parameters can be either 'H' or 'D' to indicate if the buffers are to be on the host or on the device respectively. When no parameters are specified, the buffers are allocated on the host. The collective benchmarks will use buffers allocated on the device if the -d option is used otherwise the buffers will be allocated on the host.

Examples:

- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_latency D D

In this run, the latency test allocates buffers at both rank 0 and rank 1 on the GPU devices.

- mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 osu_bw D H

In this run, the bandwidth test allocates buffers at rank 0 on the GPU device and buffers at rank 1 on the host.

Setting GPU affinity

GPU affinity for processes is set before MPI_Init is called in the benchmarks. The process rank on a node is normally used to do this and different MPI launchers expose this information through different environment variables. The benchmarks use an environment variable called LOCAL_RANK to get this information.

Starting with OMB v5.4.4, the benchmarks automatically identify the process rank on a node for MVAPICH2 when launched with mpirun_rsh. However, a script like below can be used to export this environment variable when using OMB to work with other MPI launchers and libraries.

#!/bin/bash

export LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK
exec $*

A copy of this script is installed as get_local_rank alongside the benchmarks. It can be used as follows:

mpirun_rsh -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank \
    ./osu_latency D D